spark_apply: Apply an R UDF in Spark
In danzafar/tidyspark: A Tidy Interface to Spark

Description Usage Arguments Details Value Examples

Applies an R User-Defined Function (UDF) to each partition of a spark_tbl.

1	spark_apply(.data, .f, schema)

`.data`	a `spark_tbl`
`.f`	a function or formula to be applied to each partition of the `spark_tbl`. Can be an anonymous function e.g. `~ head(. 10)` `.f` should have only one parameter, to which a R data.frame corresponds to each partition will be passed. The output of func should be an R data.frame.
`schema`	The schema of the resulting SparkDataFrame after the function is applied. It must match the output of func. Since Spark 2.3, the DDL-formatted string is also supported for the schema.

spark_apply is a re-implementation of SparkR::dapply. Importantly, spark_apply (and SparkR::dapply) will scan the function being passed and automatically broadcast any values from the .GlobalEnv that are being referenced. Functions from dplyr are always availiable by default.

a spark_tbl

## Not run: 
iris_tbl <- spark_tbl(iris)

# note, my_var will be broadcasted if we include it in the function
my_var <- 1

iris_tbl %>%
  spark_apply(function(.df) head(.df, my_var),
            schema(iris_tbl)) %>%
  collect

# but if you want to use a library (other than dplyr), you need to load it
# in the UDF
iris_tbl %>%
  spark_apply(function(.df) {
    require(purrr)
    .df %>%
      map_df(first)
  }, schema(iris_tbl)) %>%
  collect

# filter and add a column:
df <- spark_tbl(
  data.frame(a = c(1L, 2L, 3L),
             b = c(1, 2, 3),
             c = c("1","2","3"))
)

schema <- StructType(StructField("a", "integer"),
                     StructField("b", "double"),
                     StructField("c", "string"),
                     StructField("add", "integer"))

df %>%
  spark_apply(function(x) {
    x %>%
      filter(a > 1) %>%
      mutate(add = a + 1L)
  },
  schema) %>%
  collect

# The schema also can be specified in a DDL-formatted string.
schema <- "a INT, d DOUBLE, c STRING, add INT"
df %>%
  spark_apply(function(x) {
    x %>%
      filter(a > 1) %>%
      mutate(add = a + 1L)
  },
  schema) %>%
  collect

## End(Not run)