eval_vroom <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_vroom <- Sys.getenv("GLOBAL_EVAL")
library(vroom)
library(fs)
library(purrr)
library(dplyr)

Introduction to vroom

vroom basics

Load data into R using vroom

  1. Load the vroom() library r library(vroom)

  2. Use the vroom() function to read the transactions_1.csv file from the /usr/share/class/files folder r vroom("/usr/share/class/files/transactions_1.csv")

  3. Use the id argument to add the file name to the data frame. Use file_name as the argument's value r vroom("/usr/share/class/files/transactions_1.csv", id = "file_name")

  4. Load the prior command into a variable called vr_transactions ```r vr_transactions <- vroom("/usr/share/class/files/transactions_1.csv", id = "file_name")

    vr_transactions ```

  5. Load the file spec into a variable called vr_spec, using the spec() command ```r vr_spec <- spec(vr_transactions)

    vr_spec ```

Load multiple files

  1. Load the fs and dplyr libraries r library(fs) library(dplyr)

  2. List files in the /usr/share/class/files folder using the dir_ls() function r dir_ls("/usr/share/class/files")

  3. In the dir_ls() function, use the glob argument to pass a wildcard to list CSV files only. Load to a variable named files r files <- dir_ls("/usr/share/class/files", glob = "*.csv")

  4. Pass the files variable to vroom. Set the n_max argument to 1,000 to limit the data load for now r vroom(files, n_max = 1000)

  5. Add a col_types argument with vr_specs as its value r vroom(files, n_max = 1000, col_types = vr_spec)

  6. Use the col_select argument to pass a list object containing the following variables: order_id, date, customer_name, and price r vroom(files, n_max = 1000, col_types = vr_spec, col_select = list(order_id, date, customer_name, price) )

Load and modify multiple files

For files that are too large to have in memory, keep a summarization

  1. Use a for() loop to print the content of each vector inside files r for(i in seq_along(files)) { print(files[i]) }

  2. Switch the print() command with the vroom command, using the same arguments, except the file name. Use the files variable. Load the results into a variable called transactions. ```r for(i in seq_along(files)) { transactions <- vroom(files[i], n_max = 1000, col_types = vr_spec, col_select = list(order_id, date, customer_name, price))

    } ```

  3. Group transactions by order_id and get the total of price and the number of records. Name them total_sales and no_items respectively. Name the new variable orders r for(i in seq_along(files)) { transactions <- vroom(files[i], n_max = 1000, col_types = vr_spec, col_select = list(order_id, date, customer_name, price)) orders <- transactions %>% group_by(order_id) %>% summarise(total_sales = sum(price), no_items = n()) }

  4. Define the orders variable as NULL prior to the for loop and add a bind_rows() step to orders to preserve each summarized view. r orders <- NULL for(i in seq_along(files)) { transactions <- vroom(files[i], n_max = 1000, col_types = vr_spec, col_select = list(order_id, date, customer_name, price)) orders <- transactions %>% group_by(order_id) %>% summarise(total_sales = sum(price), no_items = n()) %>% bind_rows(orders) }

  5. Remove the transactions variable at the end of each cycle
    r orders <- NULL for(i in seq_along(files)) { transactions <- vroom(files[i], n_max = 1000, col_types = vr_spec, col_select = list(order_id, date, customer_name, price)) orders <- transactions %>% group_by(order_id) %>% summarise(total_sales = sum(price), no_items = n()) %>% bind_rows(orders) rm(transactions) }

  6. Preview the orders variable r orders



rstudio-conf-2020/big-data documentation built on Feb. 4, 2020, 5:24 p.m.