eval_adv <- FALSE
if(Sys.getenv("GLOBAL_EVAL") != "") eval_adv <- Sys.getenv("GLOBAL_EVAL")
library(connections)
library(dplyr)
library(rlang)
library(config)
library(dbplyr)
library(purrr)

Advanced Operations

Simple wrapper function

  1. Load the connections and dplyr libraries r library(connections) library(dplyr) library(dbplyr) library(config)

  2. Use connection_open() to open a Database connection r con <- connection_open( RPostgres::Postgres(), host = "localhost", user = get("user"), password = get("pwd"), port = 5432, dbname = "postgres", bigint = "integer" )

  3. Create a variable that points to the v_orders table r orders <- tbl(con, in_schema("retail", "v_orders"))

  4. Create a simple dplyr call that gets the average of all order totals r orders %>% summarise(mean = mean(order_total, na.rm = TRUE))

  5. Load the rlang library r library(rlang)

  6. Create a new function call my_mean() that will take an argument, x, and then returns the results of enquo(x) r my_mean <- function(x){ enquo(x) }

  7. Test the new function. It should return the same variable name, but inside quosure. Use order_total as its argument's value to test r my_mean(order_total)

  8. In the function, re-assign x to the result of enquo(x), and then return x r my_mean <- function(x){ x <- enquo(x) x }

  9. Test the same way again, the output should match to what it was as before r my_mean(order_total)

  10. Remove the last line that has x, add the contents of the function with the initial dplyr code from step 3. Then replace order_total with !! x r my_mean <- function(x){ x <- enquo(x) orders %>% summarise(mean = mean(!! x, na.rm = TRUE)) }

  11. Test the new function by passing order_total as x r my_mean(order_total)

  12. In the summarise() step, replace the name mean, with !! as_label(x), also replace the = sign, with := r my_mean <- function(x){ x <- enquo(x) orders %>% summarise(!! as_label(x) := mean(!! x, na.rm = TRUE)) }

  13. Run the function again, the name of the column should match the argument value r my_mean(order_total)

  14. Test the function by passing a formula, such as order_total / order_qty r my_mean(order_total / order_qty)

  15. Make the function generic, add a new argument called: .data. Inisde the function, replace orders with .data r my_mean <- function(.data, x){ x <- enquo(x) .data %>% summarise(!! as_label(x) := mean(!! x, na.rm = TRUE)) }

  16. The function now behaves more like a dplyr verb. Start with orders and then pipe into the function r orders %>% my_mean(order_total)

  17. Clean up the code by removing the pipe that inside the function r my_mean <- function(.data, x){ x <- enquo(x) summarise( .data, !! as_label(x) := mean(!! x, na.rm = TRUE) ) }

  18. Confirm that there is no change in the behavior of the function r orders %>% my_mean(order_total)

  19. Add a show_query() step to preview the resulting SQL statement r orders %>% my_mean(order_total) %>% show_query()

  20. Try the function with a non-DB backed variable, such as mtcars. Use mpg as the aggregating variable r mtcars %>% my_mean(mpg)

Multiple variables

Create functions that handle a variable number of arguments. The goal of the exercise is to create an anti-select() function.

  1. Load the purrr package r library(purrr)

  2. Use ... as the second argument of a function called de_select(). Inside the function use enquos() to parse it r de_select <- function(.data, ...){ vars <- enquos(...) vars }

  3. Test the function using orders r orders %>% de_select(order_id, date)

  4. Add a step to the function that iterates through each quosure and prefixes a minus sign to tell select() to drop that specific field. Use map() for the iteration, and quo() to create the prefixed expression. r de_select <- function(.data, ...){ vars <- enquos(...) vars <- map(vars, ~ quo(- !! .x)) vars }

  5. Run the same test to view the new results r orders %>% de_select(order_id, date)

  6. Add the select() step. Use !!! to parse the vars variable inside select() r de_select <- function(.data, ...){ vars <- enquos(...) vars <- map(vars, ~ quo(- !! .x)) select(.data, !!! vars) }

  7. Run the test again, this time the operation will take place.
    r orders %>% de_select(order_id, date)

  8. Add a show_query() step to see the resulting SQL r orders %>% de_select(order_id, date) %>% show_query()

  9. Test the function with a different data set, such as mtcars r mtcars %>% de_select(mpg, wt, am)

Multiple queries

Suggested approach to avoid passing multiple, and similar, queries to the database

  1. Create a simple dplyr piped operation that returns the mean of order_total for the months of January, February and March as a group r orders %>% filter(date_month %in% c(1,2,3)) %>% summarise(mean = mean(order_total, na.rm = TRUE))

  2. Assign the first operation to a variable called a, and create copy of the operation but changing the selected months to January, March and April. Assign the second one to a variable called b. ```r a <- orders %>% filter(date_month %in% c(1,2,3)) %>% summarise(mean = mean(order_total, na.rm = TRUE))

    b <- orders %>% filter(date_month %in% c(1,3,4)) %>% summarise(mean = mean(order_total, na.rm = TRUE)) ```

  3. Use union() to pass a and b at the same time to the database r union(a, b)

  4. Pipe the previous instruction to show_query() to confirm that the resulting query is a single one r union(a, b) %>% show_query()

  5. Assign to a new variable called months an overlapping set of months r months <- list( c(1,2,3), c(1,3,4), c(2,4,6) )

  6. Use map() to cycle through each set of overlapping months. Notice that it returns three separate results, meaning that it went to the database three times r months %>% map( ~ orders %>% filter(date_month %in% .x) %>% summarise(mean = mean(order_total, na.rm = TRUE)) )

  7. Add a reduce() operation and use union() command to create a single query r months %>% map( ~ orders %>% filter(date_month %in% .x) %>% summarise(mean = mean(order_total, na.rm = TRUE)) ) %>% reduce(function(x, y) union(x, y))

  8. Use show_query() to see the resulting single query sent to the database r months %>% map( ~ orders %>% filter(date_month %in% .x) %>% summarise(mean = mean(order_total, na.rm = TRUE)) ) %>% reduce(function(x, y) union(x, y)) %>% show_query()

Multiple queries with an overlapping range

  1. Create a table with a from and to ranges r ranges <- tribble( ~ from, ~to, 1, 4, 2, 5, 3, 7 )

  2. See how map2() works by passing the two variables as the x and y arguments, and adding them as the function r map2(ranges$from, ranges$to, ~.x + .y)

  3. Replace x + y with the dplyr operation from the previous exercise. In it, re-write the filter to use x and y as the month ranges r map2( ranges$from, ranges$to, ~ orders %>% filter(date_month >= .x & date_month <= .y) %>% summarise(mean = mean(order_total, na.rm = TRUE)) )

  4. Add the reduce() operation r map2( ranges$from, ranges$to, ~ orders %>% filter(date_month >= .x & date_month <= .y) %>% summarise(mean = mean(order_total, na.rm = TRUE)) ) %>% reduce(function(x, y) union(x, y))

  5. Add a show_query() step to see how the final query was constructed. r map2( ranges$from, ranges$to, ~ orders %>% filter(date_month >= .x & date_month <= .y) %>% summarise(mean = mean(order_total, na.rm = TRUE)) ) %>% reduce(function(x, y) union(x, y)) %>% show_query()

Characters to field names

  1. Create two character variables. One with the name of a field in flights and another with a new name to be given to the field r my_field <- "new" orders_field <- "order_total"

  2. Add a mutate() step that adds the new field. And then another step selecting just the new field r orders %>% mutate(my_field = !! orders_field) %>% select(my_field)

  3. Add a mutate() step that adds the new field. And then another step selecting just the new field r orders %>% mutate(!! my_field := !! orders_field) %>% select(my_field)

  4. Wrap orders_field inside a sym() function r orders %>% mutate(!! my_field := !! sym(orders_field)) %>% select(my_field)

  5. Pipe the code into show_query() r orders %>% mutate(!! my_field := !! sym(orders_field)) %>% select(my_field) %>% show_query()

connection_close(con)


rstudio-conf-2020/big-data documentation built on Feb. 4, 2020, 5:24 p.m.