jwatr : Tools to Query and Create Web Archive Files Using the Java Web Archive Toolkit

The Java Web Archive Toolkit ('JWAT') https://sbforge.org/display/JWAT/Overview is a library of Java objects and methods which enables reading, writing and validating web archive files.

WIP!!! Reading & writing need some optimization and edge case checking. There's also a chance I'll change the name to warc but some folks are using that package now and I dinna want to cause pain there yet.

The following functions are implemented:

Reading

Writing

httr Wrappers

Utility

NOTE: To read in typical (~800MB-1GB gzip'd WARC files) you should consider doing the following (in order) in your scripts:

options(java.parameters = "-Xmx2g")

library(rJava)
library(jwatjars)
library(jwatr)

That idiom generally provides enough heap space, but you may need to adjust the heap size if you've got larger payloads.

Alternatively, you can set the same option in your R startup scripts, but that will likely come back to bite you when moving workloads around.

Installation

devtools::install_github("hrbrmstr/jwatr")
options(width=120)

Reading WARC Files

library(rJava)
library(jwatr)
library(magick)
library(tidyverse)

# current verison
packageVersion("jwatr")
# small, uncompressed WARC file
glimpse(read_warc(system.file("extdata/bbc.warc", package="jwatr")))

# larger example
xdf <- read_warc(system.file("extdata/sample.warc.gz", package="jwatr"),
                 warc_types = "response", include_payload = TRUE)

glimpse(xdf)

# get the payload content
payload_content(url = xdf$target_uri[279], ctype = xdf$http_protocol_content_type[279], 
                xdf$http_raw_headers[[279]], xdf$payload[[279]])

# or ingest the raw bits yourself
imgs <- filter(xdf, grepl("(png|gif|jpeg)$", http_protocol_content_type))

imgs

image_read(imgs$payload[[1]])

Writing WARC Files

library(jwatr)
library(httr)
library(magick)
library(tidyverse)

tf <- tempfile("test")
wf <- warc_file(tf)

warc_write_response(wf, "https://rud.is/b/")

# store a simple httr::GET request
warc_write_response(wf, GET("https://rud.is/b/"))

warc_write_response(wf, "https://www.rstudio.com/")
warc_write_response(wf, "https://www.r-project.org/")

# all valid content types work, like this PDF
warc_write_response(wf, "http://che.org.il/wp-content/uploads/2016/12/pdf-sample.pdf")

# complex API calls can be made and the results stored in the WARC file as well
# this API call returns a JSON object
POST(
 url = "https://data.police.uk/api/crimes-street/all-crime",
 query = list( lat = "52.629729", lng = "-1.131592", date = "2017-01")
) -> uk_res

warc_write_response(wf, uk_res)
warc_write_response(wf, "https://journal.r-project.org/RLogo.png")

close_warc_file(wf)

xdf <- read_warc(sprintf("%s.warc.gz", tf), include_payload = TRUE)

glimpse(xdf)

# decode the WARC stored JSON response from the UK Crimes API
glimpse(jsonlite::fromJSON(rawToChar(xdf[6,]$payload[[1]]), flatten=TRUE))

select(xdf, content_length, http_protocol_content_type)

image_read(xdf$payload[[5]])

unlink(tf)

Streaming

The warc_stream_in() function provides a pure-R method for stream processing WARC files through the use of an R callback handler. One way of using this is to build a data frame. The following example builds a data frame of WARC response records. Space is reserved for a 10,000-element list which will get truncated or expanded as necessary:

xdf <- list(10000)
xdf_i <- 0

myfun <- function(headers, payload, ...) {
  headers <- setNames(headers, gsub("-", "_", names(headers)))
  xdf_i <<- xdf_i + 1
  headers$payload <- list(payload)
  xdf[xdf_i] <<- list(headers)
}

(n <- warc_stream_in(
  system.file("extdata/sample.warc.gz", package="jwatr"),
  myfun,
  warc_types = "response"
))

xdf <- bind_rows(xdf)

glimpse(xdf)

count(xdf, content_type)

cat(rawToChar(xdf$payload[[1]]))

Test Results

library(jwatr)
library(testthat)

date()

test_dir("tests/")

Code of Conduct

Please note that this project is released with a Contributor Code of Conduct. By participating in this project you agree to abide by its terms.



hrbrmstr/jwatr documentation built on May 31, 2019, 1:15 p.m.