knitr::opts_chunk$set( collapse = TRUE, comment = "#>", eval = FALSE ) library(where) library(dplyr) library(data.table) library(ggplot2)
The where
package has one main function run()
that provides a clean syntax
for vectorising the use of NSE (non-standard evaluation), for example in
ggplot2
, dplyr
, or data.table
. There are also two (infix) wrappers
%where%
and %for%
that provide arguably cleaner syntax. A typical example
might look like
subgroups <- .(all = TRUE, long_sepal = Sepal.Length > 6, long_petal = Petal.Length > 5.5) (iris %>% filter(x) %>% summarise(across(Sepal.Length:Petal.Width, mean), .by = Species)) %for% subgroups
Here we have a population dataset and various subpopulations of interest and we
want to apply the same code over all subpopulations. If the subpopulations were
a partition of the data (for example, a census population could be divided into
5 year age bands), then we can use group_by()
in dplyr
or faceting in
ggplot
, for example, to apply the same code over all subpopulations. In
general, however, the populations will not be so easy to apply over, for example
if we have some defined by age, others by gender, and then others as a
combination of the two. A variable that allows multiple options to be selected
(for example, ethnicity in the New Zealand Census), can alone define
subpopulations (in this case ethnic groups) that cannot be vectorised over with
the partitioning functionality (like group by and faceting) in standard
packages. The where
package makes these examples straightforward.
As a running example we will use the iris
dataset and the following
(largely unnatural) sub-populations of irises:
These subgroups can be captured with the .()
function to capture the filter
conditions used to define these populations:
subgroups <- .(all = TRUE, long_sepal = Sepal.Length > 6, long_petal = Petal.Length > 5.5)
To utilise these subgroups directly with standard R is tricky. For example we could form the separate populations with repeated code.
# With base R iris iris[iris[["Sepal.Length"]] > 6, ] # or with(iris, iris[Sepal.Length > 6]) iris[iris[["Petal.Length"]] > 5.5, ] # or with(iris, iris[Petal.Length > 5.5]) # With dplyr iris filter(iris, Sepal.Length > 6) filter(iris, Petal.Length > 5.5) # With data.table iris as.data.table(iris)[Sepal.Length > 6] as.data.table(iris)[Petal.Length > 5.5]
or this could be done by first explicitly capturing expressions (as done above
with .
) and then evaluating them:
lapply(subgroups, function(group) with(iris, iris[eval(group), ]))
This requires some comfort with managing expressions in R and can quickly get
messy with more complex queries, particularly if we want to apply across more
than one set of expressions. The run()
function hides these manipulations:
run(with(iris, iris[subgroup, ]), subgroup = subgroups) # or with(iris, iris[x, ]) %for% subgroups
A standard group by and summarise operation:
library(dplyr) subgroups = .(all = TRUE, long_sepal = Sepal.Length > 6, long_petal = Petal.Length > 5.5) functions = .(mean, sum, prod) run( iris %>% filter(subgroup) %>% summarise(across(Sepal.Length:Petal.Width, summary), .by = Species), subgroup = subgroups, summary = functions )
The same using data.table
:
library(data.table) df <- as.data.table(iris) run(df[subgroup, lapply(.SD, functions), keyby = "Species", .SDcols = Sepal.Length:Petal.Width], subgroup = subgroups, functions = functions)
Producing the same ggplot
over the different populations:
library(ggplot2) plots <- run( ggplot(filter(iris, subgroup), aes(Sepal.Length, Sepal.Width)) + geom_point() + theme_minimal(), subgroup = subgroups ) Map(function(plot, name) plot + ggtitle(name), plots, names(plots))
Or different plots for the full population:
run( ggplot(iris, aes(Sepal.Length, Sepal.Width)) + plot + theme_minimal(), plot = .(geom_point(), geom_smooth()) )
A natrual extension of the previous example can fail is a non-obvious way, due to expressions being executed differently than might be intended. For example the following does not work
# Fails run( ggplot(iris, aes(Sepal.Length, Sepal.Width)) + plot + theme_minimal(), plot = .(geom_point(), geom_smooth(), geom_quantile() + geom_rug()) )
since, for the third plot, it tries to evaluate
# Fails ggplot(iris, aes(Sepal.Length, Sepal.Width)) + (geom_quantile() + geom_rug()) + theme_minimal()
and geom_quantile() + geom_rug()
throws an error. This particular use case can
be accomplished by putting the separate geom
s in a list
run( ggplot(iris, aes(Sepal.Length, Sepal.Width)) + plot + theme_minimal(), plot = .(point = geom_point(), smooth = geom_smooth(), quantilerug = list(geom_quantile(), geom_rug())) ) # or by separating out the combined geoms as a function (also using a list) geom_quantilerug <- function() list(geom_quantile(), geom_rug()) run( ggplot(iris, aes(Sepal.Length, Sepal.Width)) + plot + theme_minimal(), plot = .(point = geom_point(), smooth = geom_smooth(), quantilerug = geom_quantilerug()) )
We can call run()
from within a function to further hide details. For
example, we could produce subpopulation summaries for the different species of
iris:
population_summaries <- function(df) run(with(df, df[subgroup, ]), subgroup = subgroups) as.data.table(iris)[, .(population_summaries(.SD)), keyby = "Species"]
As a more general example, if we are undertaking an analysis of different subpopulations, then we could fix the populations in a function and apply code immediately over all groups.
on_subpopulations <- function(expr, populations = subgroups) eval(substitute(run(expr, subgroup = populations), list(expr = substitute(expr)))) on_subpopulations(as.data.table(iris)[subgroup]) on_subpopulations( iris %>% filter(subgroup) %>% summarise(across(Sepal.Length:Petal.Width, mean), .by = Species) ) on_subpopulations( ggplot(filter(iris, subgroup), aes(Sepal.Length, Sepal.Width)) + geom_point() + theme_minimal() )
As when following the DRY (Don't Repeat Yourself) principle in general, this isolation makes it straightforward to add a new subpopulation, here by editing the subgroups:
subgroups = .(all = TRUE, long_sepal = Sepal.Length > 6, long_petal = Petal.Length > 5.5, veriscolor = Species == "versicolor")
Taking things to the absurd, we can also isolate out the analysis code:
analyses <- .(subset = as.data.table(iris)[subgroup], summarise = iris %>% filter(subgroup) %>% summarise(across(Sepal.Length:Petal.Width, mean), .by = Species), plot = ggplot(filter(iris, subgroup), aes(Sepal.Length, Sepal.Width)) + geom_point() + theme_minimal()) lapply(analyses, function(expr) do.call("on_subpopulations", list(expr)))
The ggplot
example
on_subpopulations( ggplot(filter(iris, subgroup), aes(Sepal.Length, Sepal.Width)) + geom_point() + theme_minimal() )
does not give identical results to executing the ggplot
code with the given
subgroups, since the ggplot object stores the execution environment, which will
be different.
If important, this can be remedied by capturing and passing the calling
environment in the on_subpopulations()
function:
on_subpopulations <- function(expr, populations = subgroups) { e <- parent.frame() eval(substitute(run(expr, subgroup = populations, e = e), list(expr = substitute(expr)))) }
As some syntactic sugar, there are also two infix versions of run
:
%where%
is a full infix version of run
taking the expression as the left
argument and a named list of values to be substituted as the right argument.`%for%
has slightly simplified syntax but only allows one substitution, for
the symbol x
.as.data.table(iris)[subgroup, lapply(.SD, summary), keyby = "Species", .SDcols = Sepal.Length:Petal.Width] %where% list(subgroup = subgroups[1:3], summary = functions) # note `subgroup` replaced with 'x' as.data.table(iris)[x, lapply(.SD, mean), keyby = "Species", .SDcols = Sepal.Length:Petal.Width] %for% subgroups
Complex expressions (for example, with pipes or +
) need to be wrapped with
"()" or "{}". For example
(iris %>% filter(x) %>% summarise(across(Sepal.Length:Petal.Width, mean), .by = Species)) %for% subgroups
An additional %with%
function provides a similar syntax to %where%
for
standard evaluation:
(a + b) %with% { a = 1 b = 2 }
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.