Description Usage Arguments Reading Large WARC files Examples
NOTE: The API for this functiuon is likely to change since this is a WIP. Optimizations will be occurring at the Java-level as well.
1 |
path |
path to WARF file |
warc_types |
if not |
include_payload |
if |
Typical WARC files from sources like Common Crawl http://commoncrawl.org/the-data/
are between 100 MB and ~1 GB in size (compressed). Since the goal of read_warc()
is
to bring a WARC file into an R data frame, said data frames can become quite large
(proportional to the size of the WARC file). You may need to do the following at the
top of scripts or at the start of an R session to ensure the JVM has enough room
to accommodate the vectors used in the data frame creation:
1 | options(java.parameters = "-Xmx2g")
|
The 2g
value may need to be higher in specific use cases.
Functions will eventually be provided to "stream process" WARC files vs read them all into memory.
1 2 3 4 | read_warc(system.file("extdata/bbc.warc", package="jwatr"))
read_warc(system.file("extdata/sample.warc.gz", package="jwatr"),
warc_types = "response", include_payload = FALSE)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.