Description Usage Arguments Details Note References Examples
Newer versions of wget are designed to support capturing of or
mirroring a list of URLs into a WARC archive. wget is tailor-made
for gathering web content and if you have to scrape a large number of
sites it's far more efficient and kind to the web site owners to save
the content to a WARC archive for future use.
1 2 3 4 5 | create_warc_wget(url_list, warc_path = ".", user_agent = "r-warc",
max_redirects = 5, tries = 2, waitretry = 1, timeout = 5,
warc_header = "Source: R warc package", warc_cdx = TRUE,
warc_file = "r-warc", no_warc_keep_log = TRUE, warc_max_size = "1G",
warc_tempdir = tempdir(), no_output = TRUE, .opts = NULL)
|
url_list |
character vector of URLs or a |
warc_path |
path where to store WARC archive output |
user_agent, max_redirects, tries, waitretry, timeout, warc_header, warc_cdx |
options for |
warc_file, no_warc_keep_log, warc_max_size, warc_tempdir |
options for |
no_output |
should the URL content associated with each URL also be saved in individual files? |
.opts |
a character vector of other valid options for |
wget must be available on the system PATH and be compiled with
WARC support to use this function. You can find statically linked binaries
for 32- and 64-bit systems here.
Note that there some command line options are not available in the Windows version of
wget.
The defaults for the parameters do not "mirror" web sites but will follow a sane
number of redirects and will grab the default content at the URLs in url_list.
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US)"
is a good user agent to use for sites that are expecting a browser.
http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
1 2 3 4 5 6 7 | ## Not run:
create_warc_wget(c("http://rud.is/", "http://had.co.nz/",
"http://rstudio.com/", "http://rapid7.com/"),
"~/data/webarchive/example")
cdx <- read_cdx("~/data/webarchive/example/r-warc.cdx")
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.