crawlR | R Documentation |
Batch based web-crawler utilizing the asynchronous features of R's curl package to crawl through a list of user supplied websites to given depth.
Each iteration consists of injecting seeds (if given), generating a fetch list, fetching pages to disk, parsing pages, and then updates the links in the crawlDB.
After initial seeding, subsequent iterations query the crawlDB to generate a fetch list. Additional seeds can be added at any time. Re-seeding with previously given seeds will re-crawl those seeds.
crawlR(
seeds = NULL,
work_dir = NULL,
out_dir = NULL,
max_concurr = 50,
max_concurr_host = 1,
timeout = Inf,
timeout_request = 30,
external_site = F,
crawl_delay = 30,
max_size = 1e+07,
regExIn = NULL,
regExOut = NULL,
depth = 1,
max_depth = 3,
queue_scl = 1,
topN = NULL,
max_urls_per_host = 10,
parser = crawlR:::parse_content,
score_func = NULL,
min_score = 0,
log_file = NULL,
seeds_only = F,
readability_content = F,
overwrite = F
)
seeds |
Seed url's. If null, then the work_dir must contain a linkDB. If additional seeds are provided after inital seeding, the new seed url's will be added to linkDB and fetched. |
work_dir |
(Required) Main working directory. |
out_dir |
Directory to store crawled and parsed html If null defaults to work directory. |
max_concurr |
Max. total concurrent connections open at any given time. |
max_concurr_host |
Max. total concurrent connections per host at any given time. |
timeout |
Total (as in all url's in seed list) time per each iteration (for each depth). |
timeout_request |
per url timeout. |
external_site |
If true, crawler will follow external links. |
crawl_delay |
time (in seconds) for calls to the same host. Only applies if the time is not specified by the host's robots.txt. |
max_size |
Max size of file or webpage to download and parse. |
regExIn |
url's matching this regular expression will be used. |
regExOut |
url's matching this reg-ex will be filtered out, including url's that match regExIn. |
depth |
Crawl depth for this crawl - A value of 1 only crawls the seed pages, 2 crawls links found on seeds, etc.. |
max_depth |
Where as the 'depth' variable determines the depth of the current crawl, 'max_depth' sets a maximum overall depth so that no link with depth higher than this value will be selected for crawling during the generate phase. |
queue_scl |
(deprecated) max_concur * queue_scl gives que. |
topN |
Select the 'topN' links based on score for crawling. |
max_urls_per_host |
Maximum url's from each host when creating fetch list for each link depth. |
parser |
Parsing function for page content to use. |
score_func |
URL Scoring Function. |
min_score |
minimum score during generate for urls |
log_file |
Name of log file. If null, writes to stdout(). |
seeds_only |
If true, only seeds will be pulled from linkDB. |
readability_content |
process content using readability |
overwrite |
If true, data for url will be overwritten in linkDB. |
Each phase of the process is contained within a function of the crawlR package:
injectR - Inject seeds into crawlDB.
generateR - Generate fetch list from crawlDB.
fetchR - Fetch links in fetch list.
parseR - Parse fetched pages.
updateR - Update crawlDB.
These can be called individually or using the all-in-one crawlR function.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.