read_warc: Read a WARC file (compressed or uncompressed)
In hrbrmstr/jwatr: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit

Description Usage Arguments Reading Large WARC files Examples

NOTE: The API for this functiuon is likely to change since this is a WIP. Optimizations will be occurring at the Java-level as well.

1	read_warc(path, warc_types = NULL, include_payload = FALSE)

`path`	path to WARF file
`warc_types`	if not `NULL` and one or more of `warcinfo`, `request`, `response`, `resource`, `metadata`, `revisit`, `conversion` then returned WARC records will be filtered to only include the specified record types.
`include_payload`	if `TRUE` then the payload for each WARC record will be included.

Typical WARC files from sources like Common Crawl http://commoncrawl.org/the-data/ are between 100 MB and ~1 GB in size (compressed). Since the goal of read_warc() is to bring a WARC file into an R data frame, said data frames can become quite large (proportional to the size of the WARC file). You may need to do the following at the top of scripts or at the start of an R session to ensure the JVM has enough room to accommodate the vectors used in the data frame creation:

1	options(java.parameters = "-Xmx2g")

The 2g value may need to be higher in specific use cases.

Functions will eventually be provided to "stream process" WARC files vs read them all into memory.

read_warc(system.file("extdata/bbc.warc", package="jwatr"))

read_warc(system.file("extdata/sample.warc.gz", package="jwatr"),
          warc_types = "response", include_payload = FALSE)

hrbrmstr/jwatr documentation built on May 31, 2019, 1:15 p.m.

hrbrmstr/jwatr index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

hrbrmstr/jwatr
Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit

read_warc: Read a WARC file (compressed or uncompressed)
In hrbrmstr/jwatr: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit

Description

Usage

Arguments

Reading Large WARC files

Examples

Related to read_warc in hrbrmstr/jwatr...

R Package Documentation

Browse R Packages

We want your feedback!

hrbrmstr/jwatr Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit

read_warc: Read a WARC file (compressed or uncompressed) In hrbrmstr/jwatr: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit

Description

Usage

Arguments

Reading Large WARC files

Examples

Related to read_warc in hrbrmstr/jwatr...

R Package Documentation

Browse R Packages

We want your feedback!

hrbrmstr/jwatr
Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit

read_warc: Read a WARC file (compressed or uncompressed)
In hrbrmstr/jwatr: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit