eval_dtplyr <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_dtplyr <- Sys.getenv("GLOBAL_EVAL")
library(data.table)
library(dtplyr)
library(dplyr)
library(lobstr)
library(fs)
library(purrr)

Introduction to dtplyr

dtplyr basics

Load data into R via data.table, and then wrap it with dtplyr

  1. Load the data.table, purrr and fs libraries r library(data.table) library(purrr) library(fs)

  2. Read the transactions.csv file, from the {{files}} folder. Use the fread() function to load the data into a variable called transactions r transactions <- dir_ls("{{files}}", glob = "*.csv") %>% map(fread) %>% rbindlist()

  3. Preview the data using str() r str(transactions)

  4. Load the dplyr and dtplyr libraries r library(dplyr) library(dtplyr)

  5. Use the lazy_dt() to "wrap" the transactions variable, into a new variable called dt_transactions r dt_transactions <- lazy_dt(transactions)

  6. View the dt_transactions variable's structure with str() r str(dt_transactions)

Object sizes

Confirm that dtplyr is not making copies of the original data.table

  1. Load the lobstr library r library(lobstr)

  2. Use obj_size() to obtain transactions's size in memory r obj_size(transactions)

  3. Use obj_size() to obtain dt_transactions's size in memory r obj_size(dt_transactions)

  4. Use obj_size() to obtain dt_transactions and transactions size in memory together r obj_size(transactions, dt_transactions)

How dtplyr works

Under the hood view of how dtplyr operates data.table objects

  1. Use dplyr verbs on top of dt_transactions to obtain the total sales by month r dt_transactions %>% group_by(date_month) %>% summarise(total_sales = sum(price))

  2. Load the above code into a variable called by_month r by_month <- dt_transactions %>% group_by(date_month) %>% summarise(total_sales = sum(price))

  3. Use show_query() to see the data.table code that by_month actually runs r show_query(by_month)

  4. Use str() to view how by_month, instead of modifying the data, it only adds steps that will later be operated by data.table r str(by_month)

Working with dtplyr

Learn data conversion and basic visualization techniques

  1. Use as_tibble() to convert the results of by_month into a tibble r by_month %>% as_tibble()

  2. Load the ggplot2 library

    r library(ggplot2)

  3. Use as_tibble() to convert before creating a line plot r by_month %>% as_tibble() %>% ggplot() + geom_line(aes(date_month, total_sales))

Pivot data

Review a simple way to aggregate data faster, and then pivot it as a tibble

  1. Load the tidyr library r library(tidyr)

  2. Group db_transactions by date_month and date_day, then aggregate price into total_sales r dt_transactions %>% group_by(date_month, date_day) %>% summarise(total_sales = sum(price))

  3. Copy the aggregation code above, then collect it into a tibble, and then use pivot_wider() to make the date_day the column headers. r dt_transactions %>% group_by(date_month, date_day) %>% summarise(total_sales = sum(price)) %>% as_tibble() %>% pivot_wider(names_from = date_day, values_from = total_sales)

The mutate() verb

See how dtplyr creates a copy of the original data.table object in order to make the mutate operation work the same as it does on dtplr

  1. Use mutate() and show_query() to see the copy() command being used r dt_transactions %>% mutate(new_field = price / 2) %>% show_query()

  2. Use lazy_dt() with the immutable argument set to FALSE to avoid the copy r lazy_dt(transactions, immutable = FALSE) %>% mutate(new_field = price / 2) %>% show_query()



edgararuiz/bigdataclass documentation built on Jan. 3, 2020, 6:46 p.m.