response_list_to_warc_file: Turns a list of 'httr' 'response' objects into a WARC file

Description Usage Arguments Details Examples

Description

You may not want to change your existing workflows to use the httr GET and POST helpers. It it not uncommon to lapply or purrr::map a series of httr verb cals into a list of response objects. Those that have been bitten by the intermittent HTTP errors that cause scraping loops to fail will also likely be using purrr::safely to wrap httr verb calls to ensure the loop succeeds in capturing some information.

Usage

1
2
3
4
response_list_to_warc_file(httr_response_list, path, gzip = TRUE,
  warc_date = Sys.time(), warc_record_id = NULL, warc_info = list(software
  = sprintf("jwatr %s", packageVersion("jwatr")), format =
  "WARC File Format 1.0"))

Arguments

httr_response_list

a list of httr response objects or a list of safely-wrapped httr reponse objects (i.e. httr::GET was wrapped with purrr::safely).

path

path (dir + base file name) to the created WARC file

gzip

should the WARC file be gzip'd?

warc_date

A supplied POSIXct timestamp to use to timestamp the WARC file. Current time will be used if none supplied.

warc_record_id

A unique identifier for the WARC record. If not provided one will be generated with UUIDgenerate.

warc_info

a named list of fields to go into the payload of the warcinfo record that will be at the top of the WARC file

Details

This function makes it easy to turn a list of these response objects (wrapped or plain) into a WARC file. Sure, you can save an R list to an R data file, but that won't be usable by folks outside the R ecosystem. Plus, there are scads of tools that can work with WARC files, including those in large-scale data processing environments.

List elements that are not plain or "safe" response objects will be gracefully skipped over.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
## Not run: 
urls <- c("https://rud.is/", "https://rud.is/b/")

res_list <- lapply(urls, httr::GET)

tf <- tempfile()
response_list_to_warc_file(res_list, tf)
ulink(tf)

## End(Not run)

hrbrmstr/jwatr documentation built on May 31, 2019, 1:15 p.m.