generateR: Generate fetch list of Url's from crawlDB
In barob1n/crawlR: Async Web crawler for R.

generateR

R Documentation

Generate fetch list of Url's from crawlDB

Description

Queries the crawlDB for urls matching the given parameters.

Usage

generateR(
  out_dir = NULL,
  work_dir = NULL,
  regExOut = NULL,
  regExIn = NULL,
  max_depth = NULL,
  topN = NULL,
  external_site = F,
  max_urls_per_host = 10,
  crawl_delay = NULL,
  log_file = NULL,
  seeds_only = F,
  min_score = 0
)

Arguments

`out_dir`	(Required) Output directory for this crawl.
`work_dir`	(Required) Working directory for this crawl.
`regExOut`	RegEx URL filter - omit links with these keywords.
`regExIn`	RegEx URL filter - keep links with these keywords.
`max_depth`	maximum depth for selected url's
`topN`	Choose these top links.
`external_site`	Logical. If False, host outside the seed list will NOT be crawled.
`max_urls_per_host`	Max number of URL's to generate per host.
`crawl_delay`	crawl delay for requests to the same host
`log_file`	Name of log file. If null, writes to stdout().
`seeds_only`	gen only seeds
`min_score`	minimum score for url