create_cdx: Create a CDX from a WARC file

Description Usage Arguments Details Note References

View source: R/create_cdx.r

Description

Takes as input an optionally compressed WARC file and creates a CDX file of warc_record_types with the specified fields (if available) and writes it to cdx_path. If the WARC file is compressed the CDX/WARC specification expects each WARC record to be in it's own "gzstream" (i.e you can't just gzip a plaintext WARC file and expect any CDX indexer to work.)

Usage

1
2
create_cdx(warc_path, warc_record_types = "response",
  field_spec = "abmsrVgu", cdx_path)

Arguments

warc_path

path to the WARC file to index

warc_record_types

the WARC record types to index in cdx_file. Should be a character vector of field names or "all" to index all records. NOTE: Most CDX files index WARC response records.

field_spec

(See Description)

cdx_path

where to output the CDX file

Details

Use an atomic character vector of single character CDX field specifications in the order you want them in the CDX file. The default value "abmsrVgu" is taken from the defaults used by wget in "WARC mode" and will output the:

in that order.

Note

Only "response" is the currently supported value for warc_record_types and it only indexes gz WARC files...hey, it's alpha s/w.

References

https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/


hrbrmstr/warc documentation built on May 17, 2019, 5:53 p.m.