warc : Tools to Work with the Web Archive Ecosystem

WARC files (and the metadata files that usually follow them) are the de facto method of archiving web content. There are tools in Python & Java to work with this data and there are many "big data" tools that make working with large-scale data from sites like Common Crawl and The Internet Archive very straightforward.

Now there are tools to create and work with the WARC ecosystem in R.

Possible use-cases:

warc can work with WARC files that are composed of individual gzip streams or on plaintext WARC files and can also read & generate CDX files. Support for more file types (e.g. WET, WAT, etc) are planned.

Since I ended up making some gz file functions for this package, it only seemed appropriate to expose them.

The following functions are implemented:

Installation

You need wget on your system PATH. Folks on real operating systems can do the apt-get, yum install or brew install (et al) dance for your particular system. Version 1.18+ is recommended, but any version with support for WARC extensions should do.

Windows folks will need to grab the statically linked 32-bit or 64-bit binaries from here and put them on your system PATH somewhere if you want to create WARC files in bulk using wget.

devtools::install_git("https://gitlab.com/hrbrmstr/warc.git")
options(width=120)

Usage

library(warc)
library(httr)

# current verison
packageVersion("warc")

cdx <- read_cdx(system.file("extdata", "20160901.cdx", package="warc"))

i <- 5

path <- file.path(cdx$warc_path[i], cdx$file_name[i])
start <- cdx$compressed_arc_file_offset[i]

entry <- read_warc_entry(path, start)

print(entry)

print(warc_headers(entry))

print(status_code(entry))

print(http_type(entry))

Creating + reading

library(warc)
library(purrr)
library(rvest)

warc_dir <- file.path(tempdir(), "rfolks")
dir.create(warc_dir)

urls <- c("http://rud.is/",
          "http://hadley.nz/",
          "http://dirk.eddelbuettel.com/",
          "https://jeroenooms.github.io/",
          "https://ironholds.org/")

create_warc_wget(urls, warc_dir, warc_file="rfolks-warc")

cdx <- read_cdx(file.path(warc_dir, "rfolks-warc.cdx"))

sites <- map(1:nrow(cdx),
             ~read_warc_entry(file.path(cdx$warc_path[.],
                                        cdx$file_name[.]), 
                              cdx$compressed_arc_file_offset[.]))

map(sites, ~read_html(content(., as="text", encoding="UTF-8"))) %>% 
  map_chr(~html_text(html_nodes(., "title")))

unlink(warc_dir)

Test Results

library(warc)
library(testthat)

date()

test_dir("tests/")


hrbrmstr/warc documentation built on May 17, 2019, 5:53 p.m.