generateR: Generate fetch list of Url's from crawlDB

View source: R/generateR.R

generateRR Documentation

Generate fetch list of Url's from crawlDB

Description

Queries the crawlDB for urls matching the given parameters.

Usage

generateR(
  out_dir = NULL,
  work_dir = NULL,
  regExOut = NULL,
  regExIn = NULL,
  max_depth = NULL,
  topN = NULL,
  external_site = F,
  max_urls_per_host = 10,
  crawl_delay = NULL,
  log_file = NULL,
  seeds_only = F,
  min_score = 0
)

Arguments

out_dir

(Required) Output directory for this crawl.

work_dir

(Required) Working directory for this crawl.

regExOut

RegEx URL filter - omit links with these keywords.

regExIn

RegEx URL filter - keep links with these keywords.

max_depth

maximum depth for selected url's

topN

Choose these top links.

external_site

Logical. If False, host outside the seed list will NOT be crawled.

max_urls_per_host

Max number of URL's to generate per host.

crawl_delay

crawl delay for requests to the same host

log_file

Name of log file. If null, writes to stdout().

seeds_only

gen only seeds

min_score

minimum score for url


barob1n/crawlR documentation built on May 23, 2023, 10:53 a.m.