In rstudio-conf-2020/big-data: Content and setup files for the Big Data with R class

eval_caching <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_caching <- Sys.getenv("GLOBAL_EVAL")

Spark data caching

```r
library(sparklyr)
library(dplyr)
library(readr)
library(purrr)    
```

See the machanics of how Spark is able to use files as a data source

Examine the contents of the /usr/share/class/files folder
Load the sparklyr library r library(sparklyr)
Use spark_connect() to create a new local Spark session r sc <- spark_connect(master = "local")
Load the readr and purrr libraries r library(readr) library(purrr)
Read the top 5 rows of the transactions_1 CSV file r top_rows <- read_csv("/usr/share/class/files/transactions_1.csv", n_max = 5)
Create a list based on the column names, and add a list item with "character" as its value. Name the variable file_columns r file_columns <- top_rows %>% rename_all(tolower) %>% map(function(x) "character")
Preview the contents of the file_columns variable r head(file_columns)
Use spark_read() to "map" the file's structure and location to the Spark context. Assign it to the spark_lineitems variable r spark_lineitems <- spark_read_csv( sc, name = "orders", path = "/usr/share/class/files", memory = FALSE, columns = file_columns, infer_schema = FALSE )
In the Connections pane, click on the table icon by the transactions variable
Verify that the new variable pointer works by using tally() r spark_lineitems %>% tally()

Learn how to cache a subset of the data in Spark

Create a subset of the orders table object. Summarize by date, careate a total price and number of items sold. r daily_orders <- spark_lineitems %>% mutate(price = as.double(price)) %>% group_by(date) %>% summarise(total_sales = sum(price, na.rm = TRUE), no_items = n())
Use compute() to extract the data into Spark memory r cached_orders <- compute(daily_orders, "daily")
Confirm new variable pointer works r head(cached_orders)
Go to the Spark UI
Click the Storage button
Notice that "orders" is now cached into Spark memory

rstudio-conf-2020/big-data documentation built on Feb. 4, 2020, 5:24 p.m.

Note that we can't provide technical support on individual packages. You should contact the package authors for that.