library(gofl) library(dplyr)
A common task in data science pipelines is to summarize a dataset by subgroups. For example, consider everyone's (least) favorite dataset mtcars
:
str(mtcars)
This dataset contains two binary variables by which one may be interested in grouping the data: vs
and am
. Two binary variables present 3 grouping (for $2^3 = 8$ possible groups total) for summarizing the data: marginally by the two groups within each variable, or jointly by the 4 groups defined by the cross of the two variables.
One way to obtain these groups is to use dplyr::group_by
. By way of example, say we are interested in the mean miles per gallon within each of these groups:
summarize_mpg <- function(dt) summarize(dt, mean(mpg)) mtcars %>% group_by(vs) %>% summarize_mpg mtcars %>% group_by(am) %>% summarize_mpg mtcars %>% group_by(vs, am) %>% summarize_mpg
That works, but as the number of grouping variables increases, this approach becomes untenable. gofl
takes a different approach. The following gofl
formula specifies groupings for each of the margins and the joint:
mt_groups <- ~ vs + am + vs:am # or equivalently mt_groups <- ~ vs*am
A gofl
specification is two parts: a formula
defining the groupings and a list
defining levels of the variables in the formula. At this time, the list
of levels needs to be created by hand. Future versions of gofl
may provide a means to automate this step.
dat <- list( vs = sort(unique(mtcars$vs)), # The sorting isn't strictly necessary. am = sort(unique(mtcars$am)) )
With these two pieces of data the grouping specification can be created:
grps <- create_groupings(formula = mt_groups, data = dat)
The result of create_groupings
is a list with three elements: r paste(names(grps), collapse = ", ")
. The data
element is a just copy of the data
argument. The groupings
element is a list
where each element is a specification for each group defined by the formula. In our example, the first group defines the subset containing vs == 0
:
str(grps$groupings[[1]])
A single group's specification has 4 parts:
i
: the index of the group (described below)q
: a quosure
that can be used to carry out the subgrouping in (e.g.) dplyr::filter
g
: a list
that specifies which variables and levels define the grouptags
: a character
vector of tags (See tags section)Here's another example of a grouping; in this case it is the subset where vs == 1
and am == 0
:
str(grps$groupings[[7]])
With the groupings now defined we can now create a pipeline for summarizing within each group.
purrr::map_dfr( .x = grps$groupings, .f = ~ { mtcars %>% filter(!!! .x$q) %>% summarize_mpg() %>% bind_cols(tibble(!!! .x$g)) } )
That's the basic idea of gofl
. From the patterns described above you can specify much more complicated groupings.
Each element of a grouping is even a unique index as follows:
v1-v2-v3-...-vn
, where the order is the order of the variables given in the create_groupings
data
argument. In the example, above vs
corresponds to the first position (v1
) and am
to the second (v2
).vs = 0
corresponds to 1
, vs = 1
corresponds to 2
. A group not involving vs
corresponds to 0
.The index 1-0
in our example is the group defined by vs == 0
; 0-2
is the group defined by am == 1
; 1-2
is defined by vs == 0 & am == 1
.
The index_fcn
part of a create_groupings
object allows the user to look up indices.
grps$index_fcn(am = 0) grps$index_fcn(am = 0, vs = 1) grps$index_fcn(vs = 1)
Indices are useful when the number of groups is large and you need a way to quickly find a particular group.
In real applications, you may want downstream processes to apply different functions to different groups. The tag
function allows you to tag particular groups by arbitrary character
vectors. Let's make our example above even more unrealistic but nonetheless illustrative and take the mean in the marginal groups and the median in joint group.
First, we tag the grouping as appropriate:
mt_groups <- ~ tag(vs + am, "marginal") + tag(vs:am, "joint") grps <- create_groupings(formula = mt_groups, data = dat)
Note that the tag
element is now populated:
grps$groupings[c(1,5)]
An example pipeline taking advantage of the tag
may look like:
summarize_mpg2 <- function(dt, tag){ avg <- switch(tag, "marginal" = mean, "joint" = median) summarize(dt, avg(mpg)) } purrr::map_dfr( .x = grps$groupings, .f = ~ { mtcars %>% filter(!!! .x$q) %>% summarize_mpg2(.x$tag) %>% bind_cols(tibble(!!! .x$g)) } )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.