In rstudio-conf-2020/big-data: Content and setup files for the Big Data with R class

eval_pipe <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_pipe <- Sys.getenv("GLOBAL_EVAL")

Spark Pipelines

library(sparklyr)
library(dplyr)

Create a simple estimator that transforms data and fits a model

Use the spark_lineitems variable to create a new aggregation by order_id. Summarize the total sales and number of items ```r

```
Assign the code to a new variable called orders r orders <-
Start a new code chunk, with calling ml_pipeline(sc) r ml_pipeline(sc)
Pipe the ml_pipeline() code into a ft_dplyr_transfomer() call. Use the orders variable for its argument ```r ml_pipeline(sc) %>%

```
Add an ft_binarizer() step that determines if the total sale is above $50. Name the new variable above_50 r ml_pipeline(sc) %>%
Using the ft_r_formula, add a step that sets the model's formula to: above_50 ~ no_items r ml_pipeline(sc) %>%
Finalize the pipeline by adding a ml_logistic_regression() step, no arguments are needed r ml_pipeline(sc) %>%
Assign the code to a new variable called orders_plan r orders_plan <- ml_pipeline(sc) %>%
Call orders_plan to confirm that all of the steps are present r orders_plan

Execute the planned changes to obtain a new model

Use ml_fit() to execute the changes in order_plan using the spark_lineitems data. Assign to a new variable called orders_fit r orders_fit <-
Call orders_fit to see the print-out of the newly fitted model r orders_fit

Overview of how to use a fitted pipeline to run predictions

Use ml_transform() in order to use the orders_fit model to run predictions over spark_lineitems r orders_preds <- ml_transform(orders_fit, spark_lineitems)
With count(), compare the results from above_50 against the predictions, the variable created by ml_transform() is called prediction
```r

```

Overview of how to save the Estimator and the Transformer

Use ml_save() to save order_plan in a new folder called "saved_model" ```r

```
Navigate to the "saved_model" folder to inspect its contents
Use ml_save() to save orders_fit in a new folder called "saved_pipeline" ```r

```
Navigate to the "saved_pipeline" folder to inspect its contents

rstudio-conf-2020/big-data documentation built on Feb. 4, 2020, 5:24 p.m.

Note that we can't provide technical support on individual packages. You should contact the package authors for that.