sparkler.crawl: Launch a crawl

Description Usage Arguments Details Value Examples

View source: R/sparkler.R

Description

Launch a crawl

Usage

1
2
sparkler.crawl(vm, url, topUrls, topGroups, maxIter, debug = FALSE,
  mode = "default")

Arguments

vm

The Instance object

url

URL website to crawl

topUrls

Number of URLs in each website

topGroups

Number of hosts to fetch in parallel.

maxIter

Number of iterations to run.

debug

If TRUE, will see debug messages.

mode

Choose your delays (default:1000ms,fast:500ms,turbo:100ms) between two fetch requests for the same host

Details

Check if Docker exists and running - If not, we create the docker with Sparkler with the "docker run" command - If exists, we restart it Next, we use "sparkler crawl" to inject URL parameters in Sparkler and launch a crawl.

Very important: Sparkler developers slow down the crawl to avoid getting blocked from the websites. Top groups = number of hosts to fetch in parallel. Top N = number of URLs in each website. By default, it tries for 256 groups and 1000 URLs in each group

Value

The crawl Id

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
## Not run: 

library(RsparkleR)

ovh <- import_ovh()
client <- load_client(ovh, endpoint, application_key, application_secret, consumer_key)

sshPubKeyPath  <- 'C:/Users/vterrasi/.ssh/id_rsa.pub'
sshPrivKeyPath <- 'C:/Users/vterrasi/.ssh/id_rsa'

vm <- sparkler.create(client, regionVM="UK1", typeVM="s1-4", sshPubKeyPath, sshPrivKeyPath)

sparkler.start(vm, debug)
url <- "https://www.YOUR WEBSITE.com"
pattern <- "www.YOUR WEBSITE.com"

topN <- 1000
maxIter <- 100;
topGroups <- 2

crawlid <- sparkler.crawl(vm, url, topN, topGroups, maxIter, debug=FALSE, mode="fast")


## End(Not run)

voltek62/RsparkleR documentation built on May 19, 2019, 1:48 a.m.