warc: Tools to Work with the Web Archive Ecosystem

Description Details Author(s)

Description

WARC files (and the metadata files that usually follow them) are the de facto method of archiving web content.

Details

There are tools in Python & Java to work with this data and there are many "big data" tools that make working with large-scale data from sites like Common Crawl and The Internet Archive very straightforward.

Now there are tools to create and work with the WARC ecosystem in R.

Possible use-cases:

warc can work with WARC files that are composed of individual gzip streams or on plaintext WARC files and can also read & generate CDX files. Support for more file types (e.g. WET, WAT, etc) are planned.

Author(s)

Bob Rudis (@hrbrmstr)


hrbrmstr/warc documentation built on May 17, 2019, 5:53 p.m.