Web ARChive

Web ARChive
Filename extension	.warc
Internet media type	application/warc[1]
Extended from	ARC[2]
Standard	ISO 28500:2017[3]
Open format?	Yes
Website	iipc.github.io/warc-specifications/specifications/warc-format/warc-1.1/

The Web ARChive (WARC) archive format specifies a method for combining multiple digital resources into an aggregate archive file together with related information. The WARC format is a revision of the Internet Archive's ARC_IA File Format[4] that has traditionally been used to store "web crawls" as sequences of content blocks harvested from the World Wide Web. The WARC format generalizes the older format to better support the harvesting, access, and exchange needs of archiving organizations. Besides the primary content currently recorded, the revision accommodates related secondary content, such as assigned metadata, abbreviated duplicate detection events, and later-date transformations.[5] The WARC format is inspired by HTTP/1.0 streams, with a similar header and the use of CRLFs as delimiters, making it very conducive to crawler implementations.

WARC is now recognised by most national library systems as the standard to follow for web archiving.[6]

Software

References

"application/warc". Retrieved 17 March 2018.
"Introduction". Retrieved 5 March 2015.
"Information and documentation -- WARC file format". Retrieved 16 March 2018.
"ARC_IA, Internet Archive ARC file format". www.digitalpreservation.gov. Retrieved 2015-05-09.
"WARC, Web ARChive file format". www.digitalpreservation.gov. Retrieved 2015-05-09.
http://digitalia.sbn.it/article/view/1473
Scrivano, Giuseppe (August 6, 2012). "GNU wget 1.14 released". GNU wget 1.14 released. Free Software Foundation, Inc. Retrieved February 25, 2016.

External links

This article is issued from Wikipedia. The text is licensed under Creative Commons - Attribution - Sharealike. Additional terms may apply for the media files.

[1] "application/warc". Retrieved 17 March 2018.

[2] "Introduction". Retrieved 5 March 2015.

[3] "Information and documentation -- WARC file format". Retrieved 16 March 2018.

[4] "ARC_IA, Internet Archive ARC file format". www.digitalpreservation.gov. Retrieved 2015-05-09.

[5] "WARC, Web ARChive file format". www.digitalpreservation.gov. Retrieved 2015-05-09.

[6] ttp://digitalia.sbn.it/article/view/1473

[FSF2012-7] Scrivano, Giuseppe (August 6, 2012). "GNU wget 1.14 released". GNU wget 1.14 released. Free Software Foundation, Inc. Retrieved February 25, 2016.