Description Usage Arguments Details Note References
Takes as input an optionally compressed WARC file and creates a CDX file
of warc_record_types
with the specified fields (if available) and
writes it to cdx_path
. If the WARC file is compressed the CDX/WARC
specification expects each WARC record to be in it's own "gzstream" (i.e you
can't just gzip
a plaintext WARC file and expect any CDX indexer
to work.)
1 2 | create_cdx(warc_path, warc_record_types = "response",
field_spec = "abmsrVgu", cdx_path)
|
warc_path |
path to the WARC file to index |
warc_record_types |
the WARC record types to index in |
field_spec |
(See |
cdx_path |
where to output the CDX file |
Use an atomic character vector of single character CDX field specifications
in the order you want them in the CDX file. The default value
"abmsrVgu
" is taken from the defaults used by wget
in
"WARC mode" and will output the:
original url
date
mime type of original document
response code
redirect
compressed arc file offset
file name
URN (warc record id)
in that order.
Only "response
" is the currently supported value for
warc_record_types
and it only indexes gz WARC files...hey, it's alpha s/w.
https://iipc.github.io/warc-specifications/specifications/cdx-format/cdx-2015/
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.