knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

The ezPurrr package is built on top of the purrr library. The purpose of the package is to help make functional programming workflows more simple and efficient. Let's start by loading ezPurrr and a few other packages we'll use for data manipulation in this vignette.

library(ezPurrr)
library(dplyr)
library(tidyr)
library(ggplot2)

ezPurrr requires a nested dataset. For example

nest_df <- palmerpenguins::penguins %>%
  group_by(island, species) %>%
  nest()

head(nest_df)

This data frame has columns for the penguin species and the island on which the data was collected but, critically, an additional data column. This column is a list column that has all the data for the corresponding species/island row. All functions in ezPurrr require the data to first be in this format.

From here, we want to be able to work with the data like normal, but it's difficult when the data are in this list column format. This is where teh sample_*() functions come in.

sample_*()

sample_row()

The sample_row() function returns one random or one particular row (when an index argument is supplied) from the nested data as a list or a data frame. If returning a list (default), the first item of the list with be the data from the given row, while the remaining elements of the list will be the grouping variables from that row (i.e., the species and island for the corresponding row, in our previous example). If returning a data frame (with type = df argument), then the data frame will include one row of the nested data, as well as each grouping variable as a column (with identical repeated content).

For example

nest_df %>% 
  sample_row()

nest_df %>% 
  sample_row(index = 3) %>% 
  head()

nest_df %>% 
  sample_row(index = 3, type = 'df') %>% 
  head()

sample_data()

sample_data() returns one random or one particular data column (with index argument) from the nested data. The primary difference between sample_data() versus sample_row() is that the grouping variables are not returned - just the data. For example:

nest_df %>% 
  sample_data()

nest_df %>% 
  sample_data(index = 3)

sample_group()

Finally, sample_group() returns one random or one particular grouping columns (with index argument) from the nested dataset with no data. For example:

nest_df %>% 
  sample_group()

nest_df %>% 
  sample_group(index = 3)

broadcast()

broadcast()

Generally, we will use the data we have sampled to conduct some operations. Once we've conducted those operations we just need to wrap them all in a function, then broadcast() them across the sets of data (i.e., all rows).

In other words, the sampling dataset is used to test the code you want to be applied for each row, such as figures, models, or transformations.

Let's look at an example for plotting. First, we create our sample data.

samp <- nest_df %>% 
  sample_data(index = 3)

samp

Next, we write the code to create our plot, using samp as our sample data to test our code.

ggplot(samp, aes(x = bill_length_mm,
                 y = bill_depth_mm,
                 color = body_mass_g)) +
  geom_point()

Finally, we can wrap this into a function.

plotting <- function(data){
  ggplot(data, aes(x = bill_length_mm,
                   y = bill_depth_mm,
                   color = body_mass_g)) +
    geom_point()
}

Notice the code above is exactly as we had before, we've just changed our sample data set to data as an argument within a function. Normally, writing functions for plotting (and many other functions in the tidyverse) is difficult because of non-standard evaluation, but in this case we don't have to worry about any of that.

Now, we can broadcast() the function to all of the data!

broadcasted_df = nest_df %>% 
  broadcast(plotting)
broadcasted_df

Notice we now have a new column, called output, which has all the plots! To look at the first plot, we can print it with

broadcasted_df$output[[1]]

Or, similarly:

broadcasted_df %>% 
  pull(output) %>% 
  purrr::pluck(1)

You can also use it for modeling or other functions. For example:

tmp_df = nest_df %>% 
  sample_data(index = 3)

lm(body_mass_g ~ bill_length_mm * bill_depth_mm, data = tmp_df)

modeling <- function(data){
   lm(body_mass_g ~ bill_length_mm + bill_depth_mm, data = data)
}

nest_df %>% 
  broadcast(modeling) %>% 
  .$output

Instead of a list, you can also make the output a number (double).

modeling <- function(data){
  model = lm(body_mass_g ~ bill_length_mm + bill_depth_mm, data = data)
  model$coefficients[2] # the slope for bill_length_mm
}

nest_df %>% broadcast(modeling) %>% .$output

broadcast_group()

broadcast_group() allows you to also use grouping variables in the function. It is still under development so currently it only works in a more limited way. For example, you can include grouping variables as the title for plots.

df = nest_df %>% 
  sample_row(index = 3, type = 'list')

plotting = function(data, species, island){
  ggplot(data, aes(x = bill_length_mm,
                             y = bill_depth_mm,
                             color = body_mass_g)) +
    geom_point() +
    labs(title = paste(species, island))
}

# example with a single plot
plotting(df$data, df$species, df$island)

nest_df = nest_df %>% 
  broadcast_group(plotting)

nest_df$output[[1]]
nest_df$output[[4]]
tmp <- quo(
 data %>% 
   filter(bill_length_mm > 50) %>% 
   select(bill_length_mm, bill_depth_mm) %>% 
   cor()
 )

tmp

library(rlang)
expr <- quo_get_expr(tmp)
chr_expr <- gsub("ex_data", "data", expr)
expr <- parse_expr(paste0(parse_exprs(chr_expr[-1]), collapse = " %>% "))

map(nested_d, !!tmp)
#This is just and example code chunk where Daniel was playing with the package.

library(palmerpenguins)
library(purrr)
library(dplyr)
library(ggplot2)
library(ezPurrr)

nested_d <- penguins %>% 
  nest_by(species) 

ex_data <- nested_d %>% 
  sample_data()

ggplot(ex_data, aes(sex, flipper_length_mm)) +
  geom_boxplot(aes(fill = sex)) +
  guides(fill = "none") +
  theme_minimal()

# wrap in a function
p <- function(d) {
  ggplot(d, aes(.data$sex, .data$flipper_length_mm)) +
    geom_boxplot(aes(fill = .data$sex)) +
    guides(fill = "none") +
    theme_minimal()
}

# this doesn't seem to work
#broadcast(nested_d, p)

# this does, so I would have expected the above to
map(nested_d$data, p)

# try a second one
ex_data %>% 
  filter(bill_length_mm > 50) %>% 
  select(bill_length_mm, bill_depth_mm) %>% 
  cor() 

# wrap in function
f <- function(d) {
  d %>% 
    filter(.data$bill_length_mm > 50) %>% 
    select(.data$bill_length_mm, .data$bill_depth_mm) %>% 
    cor() 
}
# broadcast(nested_d, f) # also doesn't work
map(nested_d$data, f) # this does

# Me messing about with quosures unsuccessfully 
# tmp <- quo(
#   data %>% 
#     filter(bill_length_mm > 50) %>% 
#     select(bill_length_mm, bill_depth_mm) %>% 
#     cor()
#   )
# 
# tmp
# 
# library(rlang)
# expr <- quo_get_expr(tmp)
# chr_expr <- gsub("ex_data", "data", expr)
# expr <- parse_expr(paste0(parse_exprs(chr_expr[-1]), collapse = " %>% "))
# 
# map(nested_d, !!tmp)


wanjiag/ezPurrr documentation built on May 10, 2022, 9:32 p.m.