spark_apply | R Documentation |
Applies an R function to a Spark object (typically, a Spark DataFrame).
spark_apply(
x,
f,
columns = NULL,
memory = TRUE,
group_by = NULL,
packages = NULL,
context = NULL,
name = NULL,
barrier = NULL,
fetch_result_as_sdf = TRUE,
partition_index_param = "",
arrow_max_records_per_batch = NULL,
auto_deps = FALSE,
...
)
x |
An object (usually a |
f |
A function that transforms a data frame partition into a data frame.
The function Can also be an |
columns |
A vector of column names or a named vector of column types for
the transformed object. When not specified, a sample of 10 rows is taken to
infer out the output columns automatically, to avoid this performance penalty,
specify the column types. The sample size is configurable using the
|
memory |
Boolean; should the table be cached into memory? |
group_by |
Column name used to group by data frame partitions. |
packages |
Boolean to distribute Defaults to For clusters using Yarn cluster mode, For offline clusters where For clusters where R packages already installed in every worker node,
the |
context |
Optional object to be serialized and passed back to |
name |
Optional table name while registering the resulting data frame. |
barrier |
Optional to support Barrier Execution Mode in the scheduler. |
fetch_result_as_sdf |
Whether to return the transformed results in a Spark
Dataframe (defaults to NOTE: |
partition_index_param |
Optional if non-empty, then NOTE: when |
arrow_max_records_per_batch |
Maximum size of each Arrow record batch, ignored if Arrow serialization is not enabled. |
auto_deps |
[Experimental] Whether to infer all required R packages by
examining the closure |
... |
Optional arguments; currently unused. |
spark_config()
settings can be specified to change the workers
environment.
For instance, to set additional environment variables to each
worker node use the sparklyr.apply.env.*
config, to launch workers
without --vanilla
use sparklyr.apply.options.vanilla
set to
FALSE
, to run a custom script before launching Rscript use
sparklyr.apply.options.rscript.before
.
## Not run:
library(sparklyr)
sc <- spark_connect(master = "local[3]")
# creates an Spark data frame with 10 elements then multiply times 10 in R
sdf_len(sc, 10) %>% spark_apply(function(df) df * 10)
# using barrier mode
sdf_len(sc, 3, repartition = 3) %>%
spark_apply(nrow, barrier = TRUE, columns = c(id = "integer")) %>%
collect()
## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.