RsparkleR
provides an R interface for launching virtual machines and deploying Sparkler as painless as possible with a few lines from your local R session.
Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix.
See all documentation on the sparkler website
Detailled instructions here : https://data-seo.com/2017/12/17/advanced-r-programming-seo-crawler
Configure a OVH Cloud Project with billing https://api.ovh.com/createToken/index.cgi?GET=/&POST=/&PUT=/&DELETE=/
Create your SSH keys : sshPubKeyPath, sshPrivKeyPath
Put your regionVM ( SBG3,BHS3,WAW1,UK1,DE1,GRA3)
SBG3 Datacenter is in France
GRA3 Datacenter is in France
Put your typeVM (s1-2,s1-4,...) and SSH Key About range of cloud servers : https://www.ovh.co.uk/public-cloud/instances/prices/
Run library(RsparkleR)
ovh <- importOvh()
client <- loadClient(ovh,endpoint,application_key,application_secret,consumer_key)
vm <- createSparkler(client,regionVM='UK1',typeVM='s1-4',sshPubKeyPath,sshPrivKeyPath)
startSparkler(vm, prod=TRUE, debug=TRUE)
. Be patient for the first time.crawlid <- startCrawl(vm, url="https://data-seo.com", topUrls=100, topGroups=5, maxIter=2, debug=TRUE)
crawlDF <- readSolr(vm, pattern, crawlid, topUrls=100, extracted=TRUE)
harbor
will be published to CRAN, it will become a dependency for this one.Github
library(devtools)
install_github("voltek62/RsparkleR")
CRAN version:
Waiting...
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.