Description Usage Arguments Details Note References
Takes as input an optionally compressed WARC file and creates a CDX file
of warc_record_types with the specified fields (if available) and
writes it to cdx_path. If the WARC file is compressed the CDX/WARC
specification expects each WARC record to be in it's own "gzstream" (i.e you
can't just gzip a plaintext WARC file and expect any CDX indexer
to work.)
1 2 | create_cdx(warc_path, warc_record_types = "response",
field_spec = "abmsrVgu", cdx_path)
|
warc_path |
path to the WARC file to index |
warc_record_types |
the WARC record types to index in |
field_spec |
(See |
cdx_path |
where to output the CDX file |
Use an atomic character vector of single character CDX field specifications
in the order you want them in the CDX file. The default value
"abmsrVgu" is taken from the defaults used by wget in
"WARC mode" and will output the:
original url
date
mime type of original document
response code
redirect
compressed arc file offset
file name
URN (warc record id)
in that order.
Only "response" is the currently supported value for
warc_record_types and it only indexes gz WARC files...hey, it's alpha s/w.
https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.