jwatr
: Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit
The Java Web Archive Toolkit ('JWAT') https://sbforge.org/display/JWAT/Overview is a library of Java objects and methods which enables reading, writing and validating web archive files.
WIP!!! Reading & writing need some optimization and edge case checking. There's also a chance I'll change the name to warc
but some folks are using that package now and I dinna want to cause pain there yet.
The following functions are implemented:
Reading
read_warc
: Read a WARC file (compressed or uncompressed)warc_stream_in
: Stream in records from a WARC fileWriting
warc_file
: Create a new WARC filewarc_write_warcinfo
: Write a 'warcinfo' record to a WARC Filewarc_write_response
: Write simple httr::GET
requests or full httr
response
objects to a WARC fileclose_warc_file
: Close a WARC filehttr
Wrappers
warc_GET
: WARC-ify an httr::GET requestwarc_POST
: WARC-ify an httr::GET requestUtility
response_list_to_warc_file
: Turns a list of 'httr' 'response' objects into a WARC filepayload_content
: Helper function to convert WARC raw headers+payload into something usefulis_compressed
: Test if a raw vector is gzip compressedNOTE: To read in typical (~800MB-1GB gzip'd WARC files) you should consider doing the following (in order) in your scripts:
options(java.parameters = "-Xmx2g") library(rJava) library(jwatjars) library(jwatr)
That idiom generally provides enough heap space, but you may need to adjust the heap size if you've got larger payloads.
Alternatively, you can set the same option in your R startup scripts, but that will likely come back to bite you when moving workloads around.
devtools::install_github("hrbrmstr/jwatr")
options(width=120)
library(rJava) library(jwatr) library(magick) library(tidyverse) # current verison packageVersion("jwatr")
# small, uncompressed WARC file glimpse(read_warc(system.file("extdata/bbc.warc", package="jwatr"))) # larger example xdf <- read_warc(system.file("extdata/sample.warc.gz", package="jwatr"), warc_types = "response", include_payload = TRUE) glimpse(xdf) # get the payload content payload_content(url = xdf$target_uri[279], ctype = xdf$http_protocol_content_type[279], xdf$http_raw_headers[[279]], xdf$payload[[279]]) # or ingest the raw bits yourself imgs <- filter(xdf, grepl("(png|gif|jpeg)$", http_protocol_content_type)) imgs image_read(imgs$payload[[1]])
library(jwatr) library(httr) library(magick) library(tidyverse) tf <- tempfile("test") wf <- warc_file(tf) warc_write_response(wf, "https://rud.is/b/") # store a simple httr::GET request warc_write_response(wf, GET("https://rud.is/b/")) warc_write_response(wf, "https://www.rstudio.com/") warc_write_response(wf, "https://www.r-project.org/") # all valid content types work, like this PDF warc_write_response(wf, "http://che.org.il/wp-content/uploads/2016/12/pdf-sample.pdf") # complex API calls can be made and the results stored in the WARC file as well # this API call returns a JSON object POST( url = "https://data.police.uk/api/crimes-street/all-crime", query = list( lat = "52.629729", lng = "-1.131592", date = "2017-01") ) -> uk_res warc_write_response(wf, uk_res) warc_write_response(wf, "https://journal.r-project.org/RLogo.png") close_warc_file(wf) xdf <- read_warc(sprintf("%s.warc.gz", tf), include_payload = TRUE) glimpse(xdf) # decode the WARC stored JSON response from the UK Crimes API glimpse(jsonlite::fromJSON(rawToChar(xdf[6,]$payload[[1]]), flatten=TRUE)) select(xdf, content_length, http_protocol_content_type) image_read(xdf$payload[[5]])
unlink(tf)
The warc_stream_in()
function provides a pure-R method for stream processing WARC
files through the use of an R callback handler. One way of using this is to build a
data frame. The following example builds a data frame of WARC response
records. Space
is reserved for a 10,000-element list which will get truncated or expanded as necessary:
xdf <- list(10000) xdf_i <- 0 myfun <- function(headers, payload, ...) { headers <- setNames(headers, gsub("-", "_", names(headers))) xdf_i <<- xdf_i + 1 headers$payload <- list(payload) xdf[xdf_i] <<- list(headers) } (n <- warc_stream_in( system.file("extdata/sample.warc.gz", package="jwatr"), myfun, warc_types = "response" )) xdf <- bind_rows(xdf) glimpse(xdf) count(xdf, content_type) cat(rawToChar(xdf$payload[[1]]))
library(jwatr) library(testthat) date() test_dir("tests/")
Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.