This is a proof of concept extension package for sparklyr that demonstrates creating an R front-end for a Spark package (in this case Sparking Water from H2O).

This package implements only the most basic functionality (creating an H2OContext, showing the H2O Flow interface, and converting a Spark DataFrame to an H2O Frame). Note that the package won't be developed further since it's just a demonstration.

Connecting to Spark

First we connect to Spark. The call to library(sparklingwater) will make the H2O functions available on the R search path and will also ensure that the dependencies required by the Sparkling Water package are included when we connect to Spark.

library(sparklyr)
library(sparklingwater)
sc <- spark_connect(master = "local")

H2O Context and Flow

The call to library(sparklingwater) automatically registered the Sparkling Water extension, which in turn specified that the Sparkling Water Spark package should be made available for Spark connections. Let's inspect the H2OContext for our Spark connection:

h2o_context(sc)

We can also view the H2O Flow web UI:

h2o_flow(sc)

H2O with Spark DataFrames

Let's copy the mtcars dataset to to Spark so we can access it from Sparkling Water:

library(dplyr)
mtcars_tbl <- copy_to(sc, mtcars, overwrite = TRUE)
mtcars_tbl

The use case we'd like to enable is calling the H2O algorithms and feature transformers directly on Spark DataFrames that we've manipulated with dplyr. This is indeed supported by the Sparkling Water package. Here though we'll just convert the Spark DataFrame into an H2O Frame to prove that it's possible:

mtcars_hf <- h2o_frame(mtcars_tbl)
mtcars_hf

Now we disconnect from Spark, this will result in the H2OContext being stopped as well since it's owned by the spark shell process used by our Spark connection:

spark_disconnect(sc)


jjallaire/sparklingwater documentation built on May 19, 2019, 11:38 a.m.