In javierluraschi/sparkworker: R Worker for Apache Spark

sparkworker provides support to execute arbitrary distributed r code, as any other sparklyr extension, load sparkworker, sparklyr and connecto to Apache Spark:

library(sparkworker)
library(sparklyr)

sc <- spark_connect(master = "local", version = "2.1.0")
iris_tbl <- sdf_copy_to(sc, iris)

To execute arbitrary functions use spark_apply as follows:

spark_apply(iris_tbl, function(row) {
  row$Petal_Width <- row$Petal_Width + rgamma(1, 2)
  row
})

We can calculate π using dplyr and spark_apply as follows:

library(dplyr)

sdf_len(sc, 10000) %>%
  spark_apply(function() sum(runif(2, min = -1, max = 1) ^ 2) < 1) %>%
  filter(id) %>% count() %>% collect() * 4 / 10000

Notice that spark_log shows sparklyr performing the following operations:

The Gateway receives a request to execute custom RDD of type WorkerRDD.
The WorkerRDD is evaluated on the worker node which initializes a new sparklyr backend tracked as Worker in the logs.
The backend initializes an RScript process that connects back to the backend, retrieves data, performs the clossure and updates the result.

spark_log(sc, filter = "sparklyr:", n = 30)

Finally, we disconnect:

spark_disconnect(sc)

javierluraschi/sparkworker documentation built on May 18, 2019, 5:56 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

javierluraschi/sparkworker
R Worker for Apache Spark

In javierluraschi/sparkworker: R Worker for Apache Spark

R Package Documentation

Browse R Packages

We want your feedback!

javierluraschi/sparkworker R Worker for Apache Spark

In javierluraschi/sparkworker: R Worker for Apache Spark

R Package Documentation

Browse R Packages

We want your feedback!

javierluraschi/sparkworker
R Worker for Apache Spark