sparkworker
provides support to execute arbitrary distributed r code, as any
other sparklyr
extension, load sparkworker
, sparklyr
and connecto to
Apache Spark:
library(sparkworker) library(sparklyr) sc <- spark_connect(master = "local", version = "2.1.0") iris_tbl <- sdf_copy_to(sc, iris)
To execute arbitrary functions use spark_apply
as follows:
spark_apply(iris_tbl, function(row) { row$Petal_Width <- row$Petal_Width + rgamma(1, 2) row })
We can calculate π using dplyr
and spark_apply
as follows:
library(dplyr) sdf_len(sc, 10000) %>% spark_apply(function() sum(runif(2, min = -1, max = 1) ^ 2) < 1) %>% filter(id) %>% count() %>% collect() * 4 / 10000
Notice that spark_log
shows sparklyr
performing the following operations:
Gateway
receives a request to execute custom RDD
of type WorkerRDD
.WorkerRDD
is evaluated on the worker node which initializes a new
sparklyr
backend tracked as Worker
in the logs.RScript
process that connects back to the
backend, retrieves data, performs the clossure and updates the result.spark_log(sc, filter = "sparklyr:", n = 30)
Finally, we disconnect:
spark_disconnect(sc)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.