retrieve_links: retrieve_links: Retrieving Links of Lower-level web pages of...

View source: R/retrieve_links.R

retrieve_linksR Documentation

Description

retrieve_links retrieves the Urls of mementos stored in the Internet Archive

Usage

retrieve_links(
  ArchiveUrls,
  encoding = "UTF-8",
  ignoreErrors = FALSE,
  filter = TRUE,
  pattern = NULL,
  nonArchive = FALSE
)

Arguments

ArchiveUrls

A string of the memento of the Internet Archive

encoding

Specify a encoding for the homepage. Default is 'UTF-8'

ignoreErrors

Ignore errors for some Urls and proceed scraping

filter

Filter links by top-level domain. Only sub-domains of top-level domain will be returned. Default is TRUE.

pattern

Filter links by custom pattern instead of top-level domains. Default is NULL.

nonArchive

Logical input. Can be set to TRUE if you want to use the archiveRetriever to scrape web pages outside the Internet Archive.

Value

This function retrieves the links of all lower-level web pages of mementos of a homepage available from the Internet Archive. It returns a tibble including the baseUrl and all links of lower-level web pages. However, a memento being stored in the Internet Archive does not guarantee that the information from the homepage can be actually scraped. As the Internet Archive is an internet resource, it is always possible that a request fails due to connectivity problems. One easy and obvious solution is to re-try the function.

Examples

## Not run: 
retrieve_links("http://web.archive.org/web/20190801001228/https://www.spiegel.de/")

## End(Not run)

archiveRetriever documentation built on June 22, 2024, 10:54 a.m.