README.md

RsparkleR

RsparkleR provides an R interface for launching virtual machines and deploying Sparkler as painless as possible with a few lines from your local R session.

Sparkler (contraction of Spark-Crawler) is a new web crawler that makes use of recent advancements in distributed computing and information retrieval domains by conglomerating various Apache projects like Spark, Kafka, Lucene/Solr, Tika, and Felix.

See all documentation on the sparkler website

Creating a Sparkler Cluster

Detailled instructions here : https://data-seo.com/2017/12/17/advanced-r-programming-seo-crawler

  1. Configure a OVH Cloud Project with billing https://api.ovh.com/createToken/index.cgi?GET=/&POST=/&PUT=/&DELETE=/

  2. Create your SSH keys : sshPubKeyPath, sshPrivKeyPath

  3. Put your regionVM ( SBG3,BHS3,WAW1,UK1,DE1,GRA3)

  4. SBG3 Datacenter is in France

  5. BHS3 Datacenter is in Canada
  6. WAW1 Datacenter is in Poland
  7. UK1 Datacenter is in UK
  8. DE1 Datacenter is in Deutch
  9. GRA3 Datacenter is in France

  10. Put your typeVM (s1-2,s1-4,...) and SSH Key About range of cloud servers : https://www.ovh.co.uk/public-cloud/instances/prices/

  11. Run library(RsparkleR)

  12. ovh <- importOvh()
  13. client <- loadClient(ovh,endpoint,application_key,application_secret,consumer_key)
  14. Run vm <- createSparkler(client,regionVM='UK1',typeVM='s1-4',sshPubKeyPath,sshPrivKeyPath)
  15. Wait for it to install and your instance is ready, you get vm object with ip and port 22 is open
  16. Now you can deploy your Sparkler
  17. Deploy your Docker with Sparkler : Run startSparkler(vm, prod=TRUE, debug=TRUE). Be patient for the first time.
  18. Launch crawl : crawlid <- startCrawl(vm, url="https://data-seo.com", topUrls=100, topGroups=5, maxIter=2, debug=TRUE)
  19. Get results from SolR crawlDF <- readSolr(vm, pattern, crawlid, topUrls=100, extracted=TRUE)

Thanks to

Install

Github

library(devtools)
install_github("voltek62/RsparkleR")

CRAN version:

Waiting...



voltek62/RsparkleR documentation built on May 19, 2019, 1:48 a.m.