crawlR: CrawlR - Async Web Crawler for R

View source: R/crawlR.R

crawlRR Documentation

CrawlR - Async Web Crawler for R

Description

Batch based web-crawler utilizing the asynchronous features of R's curl package to crawl through a list of user supplied websites to given depth.

Each iteration consists of injecting seeds (if given), generating a fetch list, fetching pages to disk, parsing pages, and then updates the links in the crawlDB.

After initial seeding, subsequent iterations query the crawlDB to generate a fetch list. Additional seeds can be added at any time. Re-seeding with previously given seeds will re-crawl those seeds.

Usage

crawlR(
  seeds = NULL,
  work_dir = NULL,
  out_dir = NULL,
  max_concurr = 50,
  max_concurr_host = 1,
  timeout = Inf,
  timeout_request = 30,
  external_site = F,
  crawl_delay = 30,
  max_size = 1e+07,
  regExIn = NULL,
  regExOut = NULL,
  depth = 1,
  max_depth = 3,
  queue_scl = 1,
  topN = NULL,
  max_urls_per_host = 10,
  parser = crawlR:::parse_content,
  score_func = NULL,
  min_score = 0,
  log_file = NULL,
  seeds_only = F,
  readability_content = F,
  overwrite = F
)

Arguments

seeds

Seed url's. If null, then the work_dir must contain a linkDB. If additional seeds are provided after inital seeding, the new seed url's will be added to linkDB and fetched.

work_dir

(Required) Main working directory.

out_dir

Directory to store crawled and parsed html If null defaults to work directory.

max_concurr

Max. total concurrent connections open at any given time.

max_concurr_host

Max. total concurrent connections per host at any given time.

timeout

Total (as in all url's in seed list) time per each iteration (for each depth).

timeout_request

per url timeout.

external_site

If true, crawler will follow external links.

crawl_delay

time (in seconds) for calls to the same host. Only applies if the time is not specified by the host's robots.txt.

max_size

Max size of file or webpage to download and parse.

regExIn

url's matching this regular expression will be used.

regExOut

url's matching this reg-ex will be filtered out, including url's that match regExIn.

depth

Crawl depth for this crawl - A value of 1 only crawls the seed pages, 2 crawls links found on seeds, etc..

max_depth

Where as the 'depth' variable determines the depth of the current crawl, 'max_depth' sets a maximum overall depth so that no link with depth higher than this value will be selected for crawling during the generate phase.

queue_scl

(deprecated) max_concur * queue_scl gives que.

topN

Select the 'topN' links based on score for crawling.

max_urls_per_host

Maximum url's from each host when creating fetch list for each link depth.

parser

Parsing function for page content to use.

score_func

URL Scoring Function.

min_score

minimum score during generate for urls

log_file

Name of log file. If null, writes to stdout().

seeds_only

If true, only seeds will be pulled from linkDB.

readability_content

process content using readability

overwrite

If true, data for url will be overwritten in linkDB.

Details

Each phase of the process is contained within a function of the crawlR package:

  1. injectR - Inject seeds into crawlDB.

  2. generateR - Generate fetch list from crawlDB.

  3. fetchR - Fetch links in fetch list.

  4. parseR - Parse fetched pages.

  5. updateR - Update crawlDB.

These can be called individually or using the all-in-one crawlR function.


barob1n/crawlR documentation built on May 23, 2023, 10:53 a.m.