In rstudio-conf-2020/big-data: Content and setup files for the Big Data with R class

eval_pipe <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_pipe <- Sys.getenv("GLOBAL_EVAL")

Spark Pipelines

library(sparklyr)
library(dplyr)

Build an Estimator (plan)

Create a simple estimator that transforms data and fits a model

Use the spark_lineitems variable to create a new aggregation by order_id. Summarize the total sales and number of items r spark_lineitems %>% mutate(price = as.double(price)) %>% group_by(order_id) %>% summarise(total_sales = sum(price, na.rm = TRUE), no_items = n())
Assign the code to a new variable called orders r orders <- spark_lineitems %>% mutate(price = as.double(price)) %>% group_by(order_id) %>% summarise(total_sales = sum(price, na.rm = TRUE), no_items = n())
Start a new code chunk, with calling ml_pipeline(sc) r ml_pipeline(sc)
Pipe the ml_pipeline() code into a ft_dplyr_transfomer() call. Use the orders variable for its argument r ml_pipeline(sc) %>% ft_dplyr_transformer(orders)
Add an ft_binarizer() step that determines if the total sale is above $50. Name the new variable above_50 r ml_pipeline(sc) %>% ft_dplyr_transformer(orders) %>% ft_binarizer("total_sales", "above_50", 50)
Using the ft_r_formula, add a step that sets the model's formula to: above_50 ~ no_items r ml_pipeline(sc) %>% ft_dplyr_transformer(orders) %>% ft_binarizer("total_sales", "above_50", 50) %>% ft_r_formula(above_50 ~ no_items)
Finalize the pipeline by adding a ml_logistic_regression() step, no arguments are needed r ml_pipeline(sc) %>% ft_dplyr_transformer(orders) %>% ft_binarizer("total_sales", "above_50", 50) %>% ft_r_formula(above_50 ~ no_items) %>% ml_logistic_regression()
Assign the code to a new variable called orders_plan r orders_plan <- ml_pipeline(sc) %>% ft_dplyr_transformer(orders) %>% ft_binarizer("total_sales", "above_50", 50) %>% ft_r_formula(above_50 ~ no_items) %>% ml_logistic_regression()
Call orders_plan to confirm that all of the steps are present r orders_plan

Build a Transformer (fit)

Execute the planned changes to obtain a new model

Use ml_fit() to execute the changes in order_plan using the spark_lineitems data. Assign to a new variable called orders_fit r orders_fit <- ml_fit(orders_plan, spark_lineitems)
Call orders_fit to see the print-out of the newly fitted model r orders_fit

Predictions using Spark Pipelines

Overview of how to use a fitted pipeline to run predictions

Use ml_transform() in order to use the orders_fit model to run predictions over spark_lineitems r orders_preds <- ml_transform(orders_fit, spark_lineitems)
With count(), compare the results from above_50 against the predictions, the variable created by ml_transform() is called prediction
r orders_preds %>% count(above_50, prediction)

Save the pipeline objects

Overview of how to save the Estimator and the Transformer

Use ml_save() to save order_plan in a new folder called "saved_model" r ml_save(orders_plan, "saved_model", overwrite = TRUE)
Navigate to the "saved_model" folder to inspect its contents
Use ml_save() to save orders_fit in a new folder called "saved_pipeline" r ml_save(orders_fit, "saved_pipeline", overwrite = TRUE)
Navigate to the "saved_pipeline" folder to inspect its contents

rstudio-conf-2020/big-data documentation built on Feb. 4, 2020, 5:24 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

rstudio-conf-2020/big-data
Content and setup files for the Big Data with R class

In rstudio-conf-2020/big-data: Content and setup files for the Big Data with R class

Spark Pipelines

Build an Estimator (plan)

Build a Transformer (fit)

Predictions using Spark Pipelines

Save the pipeline objects

R Package Documentation

Browse R Packages

We want your feedback!

rstudio-conf-2020/big-data Content and setup files for the Big Data with R class

In rstudio-conf-2020/big-data: Content and setup files for the Big Data with R class

Spark Pipelines

Build an Estimator (plan)

Build a Transformer (fit)

Predictions using Spark Pipelines

Save the pipeline objects

R Package Documentation

Browse R Packages

We want your feedback!

rstudio-conf-2020/big-data
Content and setup files for the Big Data with R class