eval_dtplyr <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_dtplyr <- Sys.getenv("GLOBAL_EVAL")
library(data.table)
library(dtplyr)
library(dplyr)
library(lobstr)
library(fs)
library(purrr)

Introduction to dtplyr

dtplyr basics

Load data into R via data.table, and then wrap it with dtplyr

  1. Load the data.table, dplyr, dtplyr, purrr and fs libraries r library(data.table) library(dplyr) library(dtplyr) library(purrr) library(fs)

  2. Read the transactions.csv file, from the /usr/share/class/files folder. Use the fread() function to load the data into a variable called transactions r transactions <- dir_ls("/usr/share/class/files", glob = "*.csv") %>% map(fread) %>% rbindlist()

  3. Preview the data using glimpse() ```r

    ```

  4. Use lazy_dt() to "wrap" the transactions variable into a new variable called dt_transactions ```r

    ```

  5. View dt_transactions structure with glimpse() ```r

    ```

Object sizes

Confirm that dtplyr is not making copies of the original data.table

  1. Load the lobstr library r library(lobstr)

  2. Use obj_size() to obtain transactions's size in memory ```r

    ```

  3. Use obj_size() to obtain dt_transactions's size in memory ```r

    ```

  4. Use obj_size() to obtain dt_transactions and transactions size in memory together ```r

    ```

How dtplyr works

Under the hood view of how dtplyr operates data.table objects

  1. Use dplyr verbs on top of dt_transactions to obtain the total sales by month r dt_transactions %>% group_by(date_month) %>% summarise(total_sales = sum(price))

  2. Load the above code into a variable called by_month ```r

    ```

  3. Use show_query() to see the data.table code that by_month actually runs ```r

    ```

  4. Use glimpse() to view how by_month, instead of modifying the data, only adds steps that will later be executed by data.table ```r

    ```

  5. Create a new column using mutate() r dt_transactions %>% mutate(new_field = price / 2)

  6. Use show_query() to see the copy() command being used ```r

    ```

  7. Check to confirm that the new column did not persist in dt_transactions ```r

    ```

  8. Use lazy_dt() with the immutable argument set to FALSE to avoid the copy r m_transactions <- lazy_dt(copy(transactions), immutable = FALSE)

    r m_transactions

  9. Create a new_field column in m_transactions using mutate() r m_transactions %>% mutate(new_field = price / 2)

  10. Use show_query() to see that copy() is no longer being used ```r

    ```

  11. Inspect m_transactions to see that new_field has persisted ```r

    ```

Working with dtplyr

Learn data conversion and basic visualization techniques

  1. Use as_tibble() to convert the results of by_month into a tibble r by_month %>% as_tibble()

  2. Load the ggplot2 library r library(ggplot2)

  3. Use as_tibble() to convert before creating a line plot ```r by_month %>%

    ggplot() + geom_line(aes(date_month, total_sales)) ```

Pivot data

Review a simple way to aggregate data faster, and then pivot it as a tibble

  1. Load the tidyr library r library(tidyr)

  2. Group db_transactions by date_month and date_day, then aggregate price into total_sales r dt_transactions %>% group_by(date_month, date_day) %>% summarise(total_sales = sum(price))

  3. Copy the aggregation code above, collect it into a tibble, and then use pivot_wider() to make the date_day the column headers. ```r dt_transactions %>% group_by(date_month, date_day) %>% summarise(total_sales = sum(price)) %>%

    pivot_wider(names_from = date_day, values_from = total_sales) ```



rstudio-conf-2020/big-data documentation built on Feb. 4, 2020, 5:24 p.m.