warc_stream_in: Stream in records from a WARC file

Description Usage Arguments Details Value Examples

Description

This is a pure R function that streams in WARC records and calls a callback function with the WARC headers and payload for each record, optionally filtering by a subset of WARC record types.

Usage

1
warc_stream_in(path, handler, ..., warc_types = NULL)

Arguments

path

path to WARC file

handler

callback function to call for each record

...

optional arguments to handler

warc_types

if provided, only WARC record types matching the ones specified will be streamed in and passed to f(). Valid options are: warcinfo, request, response, resource, metadata, revisit, conversion then returned WARC records.

Details

The signature of the callback function should be:

function(headers, payload, ...)

Value

the number of records processed (invisibly)

Examples

1
2
3
4
5
6
7
8
myfun <- function(headers, payload, ...) {
  print(as.numeric(headers$`content-length`) == length(payload))
}

warc_stream_in(
  system.file("extdata/sample.warc.gz", package="jwatr"),
  myfun
)

hrbrmstr/jwatr documentation built on May 31, 2019, 1:15 p.m.