Collect.web: Collect hyperlinks from web pages

View source: R/Collect.web.R

Collect.webR Documentation

Collect hyperlinks from web pages

Description

Collects hyperlinks from web pages and structures the data into a dataframe with the class names "datasource" and "web".

Usage

## S3 method for class 'web'
Collect(credential, pages = NULL, writeToFile = FALSE, verbose = FALSE, ...)

collect_web_hyperlinks(pages = NULL, writeToFile = FALSE, verbose = FALSE, ...)

Arguments

credential

A credential object generated from Authenticate with class name "web".

pages

Dataframe. Dataframe of web pages to crawl. The dataframe must have the columns page (character), type (character) and max_depth (integer). Each row is a seed web page to crawl, with the page value being the page URL. The type value is type of crawl as either "int", "ext" or "all", directing the crawler to follow only internal links, follow only external links (different domain to the seed page) or follow all links. The max_depth value determines how many levels of hyperlinks to follow from the seed site.

writeToFile

Logical. Write collected data to file. Default is FALSE.

verbose

Logical. Output additional information. Default is FALSE.

...

Additional parameters passed to function. Not used in this method.

Value

A tibble object with class names "datasource" and "web".

Examples

## Not run: 
pages <- tibble::tibble(page = c("http://vosonlab.net",
                                 "https://rsss.cass.anu.edu.au"),
                        type = c("int", "all"),
                        max_depth = c(2, 2))

webData <- webAuth |>
  Collect(pages, writeToFile = TRUE)

## End(Not run)


vosonlab/vosonSML documentation built on April 28, 2024, 6:26 a.m.