| crawlR | R Documentation | 
Batch based web-crawler utilizing the asynchronous features of R's curl package to crawl through a list of user supplied websites to given depth.
Each iteration consists of injecting seeds (if given), generating a fetch list, fetching pages to disk, parsing pages, and then updates the links in the crawlDB.
After initial seeding, subsequent iterations query the crawlDB to generate a fetch list. Additional seeds can be added at any time. Re-seeding with previously given seeds will re-crawl those seeds.
crawlR(
  seeds = NULL,
  work_dir = NULL,
  out_dir = NULL,
  max_concurr = 50,
  max_concurr_host = 1,
  timeout = Inf,
  timeout_request = 30,
  external_site = F,
  crawl_delay = 30,
  max_size = 1e+07,
  regExIn = NULL,
  regExOut = NULL,
  depth = 1,
  max_depth = 3,
  queue_scl = 1,
  topN = NULL,
  max_urls_per_host = 10,
  parser = crawlR:::parse_content,
  score_func = NULL,
  min_score = 0,
  log_file = NULL,
  seeds_only = F,
  readability_content = F,
  overwrite = F
)
seeds | 
 Seed url's. If null, then the work_dir must contain a linkDB. If additional seeds are provided after inital seeding, the new seed url's will be added to linkDB and fetched.  | 
work_dir | 
 (Required) Main working directory.  | 
out_dir | 
 Directory to store crawled and parsed html If null defaults to work directory.  | 
max_concurr | 
 Max. total concurrent connections open at any given time.  | 
max_concurr_host | 
 Max. total concurrent connections per host at any given time.  | 
timeout | 
 Total (as in all url's in seed list) time per each iteration (for each depth).  | 
timeout_request | 
 per url timeout.  | 
external_site | 
 If true, crawler will follow external links.  | 
crawl_delay | 
 time (in seconds) for calls to the same host. Only applies if the time is not specified by the host's robots.txt.  | 
max_size | 
 Max size of file or webpage to download and parse.  | 
regExIn | 
 url's matching this regular expression will be used.  | 
regExOut | 
 url's matching this reg-ex will be filtered out, including url's that match regExIn.  | 
depth | 
 Crawl depth for this crawl - A value of 1 only crawls the seed pages, 2 crawls links found on seeds, etc..  | 
max_depth | 
 Where as the 'depth' variable determines the depth of the current crawl, 'max_depth' sets a maximum overall depth so that no link with depth higher than this value will be selected for crawling during the generate phase.  | 
queue_scl | 
 (deprecated) max_concur * queue_scl gives que.  | 
topN | 
 Select the 'topN' links based on score for crawling.  | 
max_urls_per_host | 
 Maximum url's from each host when creating fetch list for each link depth.  | 
parser | 
 Parsing function for page content to use.  | 
score_func | 
 URL Scoring Function.  | 
min_score | 
 minimum score during generate for urls  | 
log_file | 
 Name of log file. If null, writes to stdout().  | 
seeds_only | 
 If true, only seeds will be pulled from linkDB.  | 
readability_content | 
 process content using readability  | 
overwrite | 
 If true, data for url will be overwritten in linkDB.  | 
Each phase of the process is contained within a function of the crawlR package:
injectR - Inject seeds into crawlDB.
generateR - Generate fetch list from crawlDB.
fetchR - Fetch links in fetch list.
parseR - Parse fetched pages.
updateR - Update crawlDB.
These can be called individually or using the all-in-one crawlR function.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.