README.md
In barob1n/crawlR: Async Web crawler for R.

`crawlR`: Web Crawler for R

Batch based web crawler that utilizes the asynchronous features of R's curl package to crawl through a list of user supplied websites.

Basic process is: 1. Inject seeds into linkDB 2. Generate fetch list from linkDB 3. Fetch links 4. Update linkDB 4. Repeat

crawlR(
    seeds = NULL,
    work_dir=NULL,
    out_dir = NULL,
    max_concurr = 50,
    max_concurr_host = 1,
    timeout = Inf,
    timeout_request=30,
    external_site = F,
    crawl_delay=30,
    max_size = 10e6,
    regExIn = NULL,
    regExOut = NULL,
    depth = 1,
    max_depth=3,
    queue_scl = 1,
    topN=NULL,
    max_urls_per_host = 10,
    parser = crawlR:::parse_content,
    score_func=NULL,
    min_score=0.0,
    log_file = NULL,
    seeds_only = F,
    crawl_int=NULL,
    readability_content=F,
    overwrite = F)

Argument |Description ------------- |---------------- seeds work_dir out_dir max_concurr max_concurr_host timeout timeout_request external_site crawl_delay max_size regExIn regExOut depth max_depth queue_scl topN max_urls_per_host parser score_func min_score | log_file | seeds_only readability_content overwrite | | Seed URL's. If NULL, then the work_dir must containg a linkDB. If additional seeds are provided after inital seeding, the new seed URL's will be added to linkDB and fetched. | (Required) Working to store results. | Directory to store results. If NULL defaults to work directory. | Max. total concurrent connections open at any given time. | Max. total concurrent connections per host at any given time. | Total (as in all url's in fetch list) time per each iteration (for each depth). | Per url timeout. | If true, crawler will follow external links. | time (in seconds) for calls to the same host. Only applies if the time is not specified by the host's robots.txt. | Max size of file or webpage to download and parse. | URL's matching this regular expression will be used. | URL's matching this reg-ex will be filtered out, including URL's that match regExIn. | Crawl depth for this crawl - A value of 1 only crawls the seed pages, 2 crawls links found on seeds, etc.. | Where as the 'depth' variable determines the depth of the current crawl, 'max_depth' sets a maximum overall depth so that no link with depth higher than this value will be selected for crawling during the generate phase. | (Deprecated) max_concur * queue_scl gives que. | Top num links to fetch per per link depth iteration. | Maximum URL's from each host when creating fetch list for each link depth. | Parsing function to use. | URL Scoring Function. Minimum score during generate for urls. Name of log file. If null, writes to stdout(). | If true, only seeds will be pulled from linkDB. | Process content using readability python module. If true, data for url will be overwritten in crawlDB.

After each iteration of crawling, the crawled pages are read from disk, parsed, and writen back to disk.

```r

devtools::install_github("barob1n/crawlR")

seeds <- c("https://www.cnn.com", "https://www.npr.org")

crawlR(seeds = seeds, work_dir="~/crawl", out_dir = "~/crawl/news/", max_concurr = 50, max_concurr_host = 5, timeout = Inf, external_site = F, crawl_delay=1, max_size = 4e6, regExOut = NULL, regExIn = NULL, depth = 1, queue_scl = 1, topN=10, max_urls_per_host = 10, parser = crawlR::parse_content)

filter_in=NULL filter_out=paste0("sports,weather")

crawlR(seeds = NULL, # no seeds - will query crawlDB work_dir= "~/crawl/", out_dir = "~/crawl/news/", max_concurr = 50, max_concurr_host = 5, timeout = Inf, external_site = F, crawl_delay=1, max_size = 4e6, regExOut = filter_out, # filter out these regExIn = filter_in, # url's must have these depth = 1, queue_scl = 1, topN=10, max_urls_per_host = 10, parser = crawlR::parse_content)

new_seeds <- c("https://ge.com", "https://www.ford.com")

crawlR(seeds = new_seeds, # seeds will be added to crawlDB work_dir= "~/crawl/", out_dir = "~/crawl/auto/", max_concurr = 50, max_concurr_host = 5, timeout = Inf, external_site = F, crawl_delay=1, max_size = 4e6, regExOut = filter_out, regExIn = filter_in, depth = 1, queue_scl = 1, topN=10, max_urls_per_host = 10, parser = crawlR::parse_content)

```

barob1n/crawlR documentation built on May 23, 2023, 10:53 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

barob1n/crawlR
Async Web crawler for R.

README.md
In barob1n/crawlR: Async Web crawler for R.

`crawlR`: Web Crawler for R

Description

Usage

Arguments

Details

Examples

install package

Create Seed List

Create crawlDB, inject seeds, and crawl.

Crawl again, this time using filters.

Run a third time, providing some new/additional seeds.

R Package Documentation

Browse R Packages

We want your feedback!

barob1n/crawlR Async Web crawler for R.

README.md In barob1n/crawlR: Async Web crawler for R.

crawlR: Web Crawler for R

Description

Usage

Arguments

Details

Examples

install package

Create Seed List

Create crawlDB, inject seeds, and crawl.

Crawl again, this time using filters.

Run a third time, providing some new/additional seeds.

R Package Documentation

Browse R Packages

We want your feedback!

barob1n/crawlR
Async Web crawler for R.

README.md
In barob1n/crawlR: Async Web crawler for R.

`crawlR`: Web Crawler for R