Description Usage Arguments Details Note References Examples
Newer versions of wget
are designed to support capturing of or
mirroring a list of URLs into a WARC archive. wget
is tailor-made
for gathering web content and if you have to scrape a large number of
sites it's far more efficient and kind to the web site owners to save
the content to a WARC archive for future use.
1 2 3 4 5 | create_warc_wget(url_list, warc_path = ".", user_agent = "r-warc",
max_redirects = 5, tries = 2, waitretry = 1, timeout = 5,
warc_header = "Source: R warc package", warc_cdx = TRUE,
warc_file = "r-warc", no_warc_keep_log = TRUE, warc_max_size = "1G",
warc_tempdir = tempdir(), no_output = TRUE, .opts = NULL)
|
url_list |
character vector of URLs or a |
warc_path |
path where to store WARC archive output |
user_agent, max_redirects, tries, waitretry, timeout, warc_header, warc_cdx |
options for |
warc_file, no_warc_keep_log, warc_max_size, warc_tempdir |
options for |
no_output |
should the URL content associated with each URL also be saved in individual files? |
.opts |
a character vector of other valid options for |
wget
must be available on the system PATH
and be compiled with
WARC support to use this function. You can find statically linked binaries
for 32- and 64-bit systems here.
Note that there some command line options are not available in the Windows version of
wget
.
The defaults for the parameters do not "mirror" web sites but will follow a sane
number of redirects and will grab the default content at the URLs in url_list
.
"Mozilla/5.0 (Windows; U; Windows NT 6.1; en-US)"
is a good user agent to use for sites that are expecting a browser.
http://www.archiveteam.org/index.php?title=Wget_with_WARC_output
1 2 3 4 5 6 7 | ## Not run:
create_warc_wget(c("http://rud.is/", "http://had.co.nz/",
"http://rstudio.com/", "http://rapid7.com/"),
"~/data/webarchive/example")
cdx <- read_cdx("~/data/webarchive/example/r-warc.cdx")
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.