eval_caching <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_caching <- Sys.getenv("GLOBAL_EVAL")

Spark data caching

```r
library(sparklyr)
library(dplyr)
library(readr)
library(purrr)    
```

Map data

See the machanics of how Spark is able to use files as a data source

  1. Examine the contents of the /usr/share/class/files folder

  2. Load the sparklyr library r library(sparklyr)

  3. Use spark_connect() to create a new local Spark session r sc <- spark_connect(master = "local")

  4. Load the readr and purrr libraries r library(readr) library(purrr)

  5. Read the top 5 rows of the transactions_1 CSV file r top_rows <- read_csv("/usr/share/class/files/transactions_1.csv", n_max = 5)

  6. Create a list based on the column names, and add a list item with "character" as its value. Name the variable file_columns r file_columns <- top_rows %>% rename_all(tolower) %>% map(function(x) "character")

  7. Preview the contents of the file_columns variable r head(file_columns)

  8. Use spark_read() to "map" the file's structure and location to the Spark context. Assign it to the spark_lineitems variable r spark_lineitems <- spark_read_csv( sc, name = "orders", path = "/usr/share/class/files", memory = FALSE, columns = file_columns, infer_schema = FALSE )

  9. In the Connections pane, click on the table icon by the transactions variable

  10. Verify that the new variable pointer works by using tally() r spark_lineitems %>% tally()

Caching data

Learn how to cache a subset of the data in Spark

  1. Create a subset of the orders table object. Summarize by date, careate a total price and number of items sold. r daily_orders <- spark_lineitems %>% mutate(price = as.double(price)) %>% group_by(date) %>% summarise(total_sales = sum(price, na.rm = TRUE), no_items = n())

  2. Use compute() to extract the data into Spark memory r cached_orders <- compute(daily_orders, "daily")

  3. Confirm new variable pointer works r head(cached_orders)

  4. Go to the Spark UI

  5. Click the Storage button

  6. Notice that "orders" is now cached into Spark memory



rstudio-conf-2020/big-data documentation built on Feb. 4, 2020, 5:24 p.m.