knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The ezPurrr package is built on top of the purrr library. The purpose of the package is to help make functional programming workflows more simple and efficient. Let's start by loading ezPurrr and a few other packages we'll use for data manipulation in this vignette.
library(ezPurrr) library(dplyr) library(tidyr) library(ggplot2)
ezPurrr requires a nested dataset. For example
nest_df <- palmerpenguins::penguins %>% group_by(island, species) %>% nest() head(nest_df)
This data frame has columns for the penguin species and the island on which the data was collected but, critically, an additional data column. This column is a list column that has all the data for the corresponding species/island row. All functions in ezPurrr require the data to first be in this format.
From here, we want to be able to work with the data like normal, but it's difficult when the data are in this list column format. This is where teh sample_*()
functions come in.
sample_*()
sample_row()
The sample_row()
function returns one random or one particular row (when an index
argument is supplied) from the nested data as a list or a data frame. If returning a list (default), the first item of the list with be the data from the given row, while the remaining elements of the list will be the grouping variables from that row (i.e., the species and island for the corresponding row, in our previous example). If returning a data frame (with type = df
argument), then the data frame will include one row of the nested data, as well as each grouping variable as a column (with identical repeated content).
For example
nest_df %>% sample_row() nest_df %>% sample_row(index = 3) %>% head() nest_df %>% sample_row(index = 3, type = 'df') %>% head()
sample_data()
sample_data() returns one random or one particular data column (with index
argument) from the nested data. The primary difference between sample_data()
versus sample_row()
is that the grouping variables are not returned - just the data. For example:
nest_df %>% sample_data() nest_df %>% sample_data(index = 3)
sample_group()
Finally, sample_group()
returns one random or one particular grouping columns (with index
argument) from the nested dataset with no data. For example:
nest_df %>% sample_group() nest_df %>% sample_group(index = 3)
broadcast()
broadcast()
Generally, we will use the data we have sampled to conduct some operations. Once we've conducted those operations we just need to wrap them all in a function, then broadcast()
them across the sets of data (i.e., all rows).
In other words, the sampling dataset is used to test the code you want to be applied for each row, such as figures, models, or transformations.
Let's look at an example for plotting. First, we create our sample data.
samp <- nest_df %>% sample_data(index = 3) samp
Next, we write the code to create our plot, using samp
as our sample data to test our code.
ggplot(samp, aes(x = bill_length_mm, y = bill_depth_mm, color = body_mass_g)) + geom_point()
Finally, we can wrap this into a function.
plotting <- function(data){ ggplot(data, aes(x = bill_length_mm, y = bill_depth_mm, color = body_mass_g)) + geom_point() }
Notice the code above is exactly as we had before, we've just changed our sample data set to data
as an argument within a function. Normally, writing functions for plotting (and many other functions in the tidyverse) is difficult because of non-standard evaluation, but in this case we don't have to worry about any of that.
Now, we can broadcast()
the function to all of the data!
broadcasted_df = nest_df %>% broadcast(plotting) broadcasted_df
Notice we now have a new column, called output
, which has all the plots! To look at the first plot, we can print it with
broadcasted_df$output[[1]]
Or, similarly:
broadcasted_df %>% pull(output) %>% purrr::pluck(1)
You can also use it for modeling or other functions. For example:
tmp_df = nest_df %>% sample_data(index = 3) lm(body_mass_g ~ bill_length_mm * bill_depth_mm, data = tmp_df) modeling <- function(data){ lm(body_mass_g ~ bill_length_mm + bill_depth_mm, data = data) } nest_df %>% broadcast(modeling) %>% .$output
Instead of a list, you can also make the output a number (double).
modeling <- function(data){ model = lm(body_mass_g ~ bill_length_mm + bill_depth_mm, data = data) model$coefficients[2] # the slope for bill_length_mm } nest_df %>% broadcast(modeling) %>% .$output
broadcast_group()
broadcast_group()
allows you to also use grouping variables in the function. It is still under development so currently it only works in a more limited way. For example, you can include grouping variables as the title for plots.
df = nest_df %>% sample_row(index = 3, type = 'list') plotting = function(data, species, island){ ggplot(data, aes(x = bill_length_mm, y = bill_depth_mm, color = body_mass_g)) + geom_point() + labs(title = paste(species, island)) } # example with a single plot plotting(df$data, df$species, df$island) nest_df = nest_df %>% broadcast_group(plotting) nest_df$output[[1]] nest_df$output[[4]]
tmp <- quo( data %>% filter(bill_length_mm > 50) %>% select(bill_length_mm, bill_depth_mm) %>% cor() ) tmp library(rlang) expr <- quo_get_expr(tmp) chr_expr <- gsub("ex_data", "data", expr) expr <- parse_expr(paste0(parse_exprs(chr_expr[-1]), collapse = " %>% ")) map(nested_d, !!tmp)
#This is just and example code chunk where Daniel was playing with the package. library(palmerpenguins) library(purrr) library(dplyr) library(ggplot2) library(ezPurrr) nested_d <- penguins %>% nest_by(species) ex_data <- nested_d %>% sample_data() ggplot(ex_data, aes(sex, flipper_length_mm)) + geom_boxplot(aes(fill = sex)) + guides(fill = "none") + theme_minimal() # wrap in a function p <- function(d) { ggplot(d, aes(.data$sex, .data$flipper_length_mm)) + geom_boxplot(aes(fill = .data$sex)) + guides(fill = "none") + theme_minimal() } # this doesn't seem to work #broadcast(nested_d, p) # this does, so I would have expected the above to map(nested_d$data, p) # try a second one ex_data %>% filter(bill_length_mm > 50) %>% select(bill_length_mm, bill_depth_mm) %>% cor() # wrap in function f <- function(d) { d %>% filter(.data$bill_length_mm > 50) %>% select(.data$bill_length_mm, .data$bill_depth_mm) %>% cor() } # broadcast(nested_d, f) # also doesn't work map(nested_d$data, f) # this does # Me messing about with quosures unsuccessfully # tmp <- quo( # data %>% # filter(bill_length_mm > 50) %>% # select(bill_length_mm, bill_depth_mm) %>% # cor() # ) # # tmp # # library(rlang) # expr <- quo_get_expr(tmp) # chr_expr <- gsub("ex_data", "data", expr) # expr <- parse_expr(paste0(parse_exprs(chr_expr[-1]), collapse = " %>% ")) # # map(nested_d, !!tmp)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.