Data-masking is a distinctive feature of R whereby programming is performed directly on a data set, with columns defined as normal objects.
# Unmasked programming mean(mtcars$cyl + mtcars$am) # Referring to columns is an error - Where is the data? mean(cyl + am) # Data-masking with(mtcars, mean(cyl + am))
While data-masking makes it easy to program interactively with data frames, it makes it harder to create functions. Passing data-masked arguments to functions requires injection with the embracing operator r link("{{")
or, in more complex cases, the injection operator [!!
].
Injection (also known as quasiquotation) is a metaprogramming feature that allows you to modify parts of a program. This is needed because under the hood data-masking works by [defusing][topic-defuse] R code to prevent its immediate evaluation. The defused code is resumed later on in a context where data frame columns are defined.
Let's see what happens when we pass arguments to a data-masking function like summarise()
in the normal way:
my_mean <- function(data, var1, var2) { dplyr::summarise(data, mean(var1 + var2)) } my_mean(mtcars, cyl, am)
The problem here is that summarise()
defuses the R code it was supplied, i.e. mean(var1 + var2)
. Instead we want it to see mean(cyl + am)
. This is why we need injection, we need to modify that piece of code by injecting the code supplied to the function in place of var1
and var2
.
To inject a function argument in data-masked context, just embrace it with {{
:
my_mean <- function(data, var1, var2) { dplyr::summarise(data, mean({{ var1 }} + {{ var2 }})) } my_mean(mtcars, cyl, am)
See r link("topic_data_mask_programming")
to learn more about creating functions around data-masking functions.
In normal R programming objects are defined in the current environment, for instance in the global environment or the environment of a function.
factor <- 1000 # Can now use `factor` in computations mean(mtcars$cyl * factor)
This environment also contains all functions currently in scope. In a script this includes the functions attached with library()
calls; in a package, the functions imported from other packages. If evaluation was performed only in the data frame, we'd lose track of these objects and functions necessary to perform computations.
To keep these objects and functions in scope, the data frame is inserted at the bottom of the current chain of environments. It comes first and has precedence over the user environment. In other words, it masks the user environment.
Since masking blends the data and the user environment by giving priority to the former, R can sometimes use a data frame column when you really intended to use a local object.
# Defining an env-variable cyl <- 1000 # Referring to a data-variable dplyr::summarise(mtcars, mean(cyl))
The tidy eval framework provides [pronouns][.data] to help disambiguate between the mask and user contexts. It is often a good idea to use these pronouns in production code.
cyl <- 1000 mtcars %>% dplyr::summarise( mean_data = mean(.data$cyl), mean_env = mean(.env$cyl) )
Read more about this in r link("topic_data_mask_ambiguity")
.
Data-masking relies on three language features:
[Argument defusal][topic-defuse] with [substitute()] (base R) or [enquo()], [enquos()], and r link("{{")
(rlang). R code is defused so it can be evaluated later on in a special environment enriched with data.
First class environments. Environments are a special type of list-like object in which defused R code can be evaluated. The named elements in an environment define objects. Lists and data frames can be transformed to environments:
r
as.environment(mtcars)
#> <environment: 0x7febb17e3468>
```r expr(1 + 1)
eval(expr(1 + 1)) ```
By default eval()
and eval_tidy()
evaluate in the current environment.
r
code <- expr(mean(cyl + am))
eval(code)
You can supply an optional list or data frame that will be converted to an environment.
r
eval(code, mtcars)
Evaluation of defused code then occurs in the context of a data mask.
The tidyverse embraced the data-masking approach in packages like ggplot2 and dplyr and eventually developed its own programming framework in the rlang package. None of this would have been possible without the following landmark developments from S and R authors.
The S language introduced data scopes with [attach()] (Becker, Chambers and Wilks, The New S Language, 1988).
The S language introduced data-masked formulas in modelling functions (Chambers and Hastie, 1993).
Peter Dalgaard (R team) wrote the frametools package in 1997. It was later included in R as [base::transform()] and [base::subset()]. This API is an important source of inspiration for the dplyr package. It was also the first apparition of selections, a variant of data-masking extended and codified later on in the tidyselect package.
In 2000 Luke Tierney (R team) changed formulas to keep track of their original environments. This change published in R 1.1.0 was a crucial step towards hygienic data masking, i.e. the proper resolution of symbols in their original environments. Quosures were inspired by the environment-tracking mechanism of formulas.
Luke introduced [base::with()] in 2001.
In 2006 the data.table package included data-masking and selections in the i
and j
arguments of the [
method of a data frame.
The dplyr package was published in 2014.
The rlang package developed tidy eval in 2017 as the data-masking framework of the tidyverse. It introduced the notions of [quosure][topic-quosure], [implicit injection][topic-inject] with !!
and !!!
, and [data pronouns][.data].
In 2019, injection with {{
was introduced in rlang 0.4.0 to simplify the defuse-and-inject pattern. This operator allows R programmers to transport data-masked arguments across functions more intuitively and with minimal boilerplate.
r link("topic_data_mask_programming")
r link("topic_defuse")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.