crawlR: CrawlR - Async Web Crawler for R
In barob1n/crawlR: Async Web crawler for R.

crawlR

R Documentation

CrawlR - Async Web Crawler for R

Description

Batch based web-crawler utilizing the asynchronous features of R's curl package to crawl through a list of user supplied websites to given depth.

Each iteration consists of injecting seeds (if given), generating a fetch list, fetching pages to disk, parsing pages, and then updates the links in the crawlDB.

After initial seeding, subsequent iterations query the crawlDB to generate a fetch list. Additional seeds can be added at any time. Re-seeding with previously given seeds will re-crawl those seeds.

Usage

crawlR(
  seeds = NULL,
  work_dir = NULL,
  out_dir = NULL,
  max_concurr = 50,
  max_concurr_host = 1,
  timeout = Inf,
  timeout_request = 30,
  external_site = F,
  crawl_delay = 30,
  max_size = 1e+07,
  regExIn = NULL,
  regExOut = NULL,
  depth = 1,
  max_depth = 3,
  queue_scl = 1,
  topN = NULL,
  max_urls_per_host = 10,
  parser = crawlR:::parse_content,
  score_func = NULL,
  min_score = 0,
  log_file = NULL,
  seeds_only = F,
  readability_content = F,
  overwrite = F
)

Arguments

`seeds`	Seed url's. If null, then the work_dir must contain a linkDB. If additional seeds are provided after inital seeding, the new seed url's will be added to linkDB and fetched.
`work_dir`	(Required) Main working directory.
`out_dir`	Directory to store crawled and parsed html If null defaults to work directory.
`max_concurr`	Max. total concurrent connections open at any given time.
`max_concurr_host`	Max. total concurrent connections per host at any given time.
`timeout`	Total (as in all url's in seed list) time per each iteration (for each depth).
`timeout_request`	per url timeout.
`external_site`	If true, crawler will follow external links.
`crawl_delay`	time (in seconds) for calls to the same host. Only applies if the time is not specified by the host's robots.txt.
`max_size`	Max size of file or webpage to download and parse.
`regExIn`	url's matching this regular expression will be used.
`regExOut`	url's matching this reg-ex will be filtered out, including url's that match regExIn.
`depth`	Crawl depth for this crawl - A value of 1 only crawls the seed pages, 2 crawls links found on seeds, etc..
`max_depth`	Where as the 'depth' variable determines the depth of the current crawl, 'max_depth' sets a maximum overall depth so that no link with depth higher than this value will be selected for crawling during the generate phase.
`queue_scl`	(deprecated) max_concur * queue_scl gives que.
`topN`	Select the 'topN' links based on score for crawling.
`max_urls_per_host`	Maximum url's from each host when creating fetch list for each link depth.
`parser`	Parsing function for page content to use.
`score_func`	URL Scoring Function.
`min_score`	minimum score during generate for urls
`log_file`	Name of log file. If null, writes to stdout().
`seeds_only`	If true, only seeds will be pulled from linkDB.
`readability_content`	process content using readability
`overwrite`	If true, data for url will be overwritten in linkDB.