create_warc_wget: Use wget to create a WARC archive for a URL list

Description Usage Arguments Details Note References Examples

View source: R/create_warc.r

Description

Newer versions of wget are designed to support capturing of or mirroring a list of URLs into a WARC archive. wget is tailor-made for gathering web content and if you have to scrape a large number of sites it's far more efficient and kind to the web site owners to save the content to a WARC archive for future use.

Usage

1
2
3
4
5
create_warc_wget(url_list, warc_path = ".", user_agent = "r-warc",
  max_redirects = 5, tries = 2, waitretry = 1, timeout = 5,
  warc_header = "Source: R warc package", warc_cdx = TRUE,
  warc_file = "r-warc", no_warc_keep_log = TRUE, warc_max_size = "1G",
  warc_tempdir = tempdir(), no_output = TRUE, .opts = NULL)

Arguments

url_list

character vector of URLs or a connection to a file with one URL per line

warc_path

path where to store WARC archive output

user_agent, max_redirects, tries, waitretry, timeout, warc_header, warc_cdx

options for wget

warc_file, no_warc_keep_log, warc_max_size, warc_tempdir

options for wget

no_output

should the URL content associated with each URL also be saved in individual files?

.opts

a character vector of other valid options for wget that will be appended to the system2 call args.

Details

wget must be available on the system PATH and be compiled with WARC support to use this function. You can find statically linked binaries for 32- and 64-bit systems here. Note that there some command line options are not available in the Windows version of wget.

The defaults for the parameters do not "mirror" web sites but will follow a sane number of redirects and will grab the default content at the URLs in url_list.

Note

"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US)" is a good user agent to use for sites that are expecting a browser.

References

http://www.archiveteam.org/index.php?title=Wget_with_WARC_output

Examples

1
2
3
4
5
6
7
## Not run: 
create_warc_wget(c("http://rud.is/", "http://had.co.nz/",
                   "http://rstudio.com/", "http://rapid7.com/"),
                   "~/data/webarchive/example")
cdx <- read_cdx("~/data/webarchive/example/r-warc.cdx")

## End(Not run)

hrbrmstr/warc documentation built on May 17, 2019, 5:53 p.m.