read_warc: Read a WARC file (compressed or uncompressed)

Description Usage Arguments Reading Large WARC files Examples

Description

NOTE: The API for this functiuon is likely to change since this is a WIP. Optimizations will be occurring at the Java-level as well.

Usage

1
read_warc(path, warc_types = NULL, include_payload = FALSE)

Arguments

path

path to WARF file

warc_types

if not NULL and one or more of warcinfo, request, response, resource, metadata, revisit, conversion then returned WARC records will be filtered to only include the specified record types.

include_payload

if TRUE then the payload for each WARC record will be included.

Reading Large WARC files

Typical WARC files from sources like Common Crawl http://commoncrawl.org/the-data/ are between 100 MB and ~1 GB in size (compressed). Since the goal of read_warc() is to bring a WARC file into an R data frame, said data frames can become quite large (proportional to the size of the WARC file). You may need to do the following at the top of scripts or at the start of an R session to ensure the JVM has enough room to accommodate the vectors used in the data frame creation:

1
options(java.parameters = "-Xmx2g")

The 2g value may need to be higher in specific use cases.

Functions will eventually be provided to "stream process" WARC files vs read them all into memory.

Examples

1
2
3
4
read_warc(system.file("extdata/bbc.warc", package="jwatr"))

read_warc(system.file("extdata/sample.warc.gz", package="jwatr"),
          warc_types = "response", include_payload = FALSE)

hrbrmstr/jwatr documentation built on May 31, 2019, 1:15 p.m.