sparkworker provides support to execute arbitrary distributed r code, as any other sparklyr extension, load sparkworker, sparklyr and connecto to Apache Spark:

library(sparkworker)
library(sparklyr)

sc <- spark_connect(master = "local", version = "2.1.0")
iris_tbl <- sdf_copy_to(sc, iris)

To execute arbitrary functions use spark_apply as follows:

spark_apply(iris_tbl, function(row) {
  row$Petal_Width <- row$Petal_Width + rgamma(1, 2)
  row
})

We can calculate π using dplyr and spark_apply as follows:

library(dplyr)

sdf_len(sc, 10000) %>%
  spark_apply(function() sum(runif(2, min = -1, max = 1) ^ 2) < 1) %>%
  filter(id) %>% count() %>% collect() * 4 / 10000

Notice that spark_log shows sparklyr performing the following operations:

  1. The Gateway receives a request to execute custom RDD of type WorkerRDD.
  2. The WorkerRDD is evaluated on the worker node which initializes a new sparklyr backend tracked as Worker in the logs.
  3. The backend initializes an RScript process that connects back to the backend, retrieves data, performs the clossure and updates the result.
spark_log(sc, filter = "sparklyr:", n = 30)

Finally, we disconnect:

spark_disconnect(sc)


javierluraschi/sparkworker documentation built on May 18, 2019, 5:56 p.m.