warc: Tools to Work with the Web Archive Ecosystem
In hrbrmstr/warc: Tools to Work with the Web Archive Ecosystem

WARC files (and the metadata files that usually follow them) are the de facto method of archiving web content.

There are tools in Python & Java to work with this data and there are many "big data" tools that make working with large-scale data from sites like Common Crawl and The Internet Archive very straightforward.

Now there are tools to create and work with the WARC ecosystem in R.

Possible use-cases:

If you need to scrape data from many URLs and would like to make the analyses on that data reproducible but are concerned that the sites may change format or may be offline but also don't want to manage individual HTML (etc) files
Analyzing Common Crawl data (etc) natively in R
Saving the entire state of an httr request (warc can turn httr responses into WARC files and turns WARC response records into httr::response objects)

warc can work with WARC files that are composed of individual gzip streams or on plaintext WARC files and can also read & generate CDX files. Support for more file types (e.g. WET, WAT, etc) are planned.

Bob Rudis (@hrbrmstr)

hrbrmstr/warc documentation built on May 17, 2019, 5:53 p.m.