Data-masking is a distinctive feature of R whereby programming is performed directly on a data set, with columns defined as normal objects.

# Unmasked programming
mean(mtcars$cyl + mtcars$am)

# Referring to columns is an error - Where is the data?
mean(cyl + am)

# Data-masking
with(mtcars, mean(cyl + am))

While data-masking makes it easy to program interactively with data frames, it makes it harder to create functions. Passing data-masked arguments to functions requires injection with the embracing operator r link("{{") or, in more complex cases, the injection operator [!!].

Why does data-masking require embracing and injection?

Injection (also known as quasiquotation) is a metaprogramming feature that allows you to modify parts of a program. This is needed because under the hood data-masking works by [defusing][topic-defuse] R code to prevent its immediate evaluation. The defused code is resumed later on in a context where data frame columns are defined.

Let's see what happens when we pass arguments to a data-masking function like summarise() in the normal way:

my_mean <- function(data, var1, var2) {
  dplyr::summarise(data, mean(var1 + var2))
}

my_mean(mtcars, cyl, am)

The problem here is that summarise() defuses the R code it was supplied, i.e. mean(var1 + var2). Instead we want it to see mean(cyl + am). This is why we need injection, we need to modify that piece of code by injecting the code supplied to the function in place of var1 and var2.

To inject a function argument in data-masked context, just embrace it with {{:

my_mean <- function(data, var1, var2) {
  dplyr::summarise(data, mean({{ var1 }} + {{ var2 }}))
}

my_mean(mtcars, cyl, am)

See r link("topic_data_mask_programming") to learn more about creating functions around data-masking functions.

What does "masking" mean?

In normal R programming objects are defined in the current environment, for instance in the global environment or the environment of a function.

factor <- 1000

# Can now use `factor` in computations
mean(mtcars$cyl * factor)

This environment also contains all functions currently in scope. In a script this includes the functions attached with library() calls; in a package, the functions imported from other packages. If evaluation was performed only in the data frame, we'd lose track of these objects and functions necessary to perform computations.

To keep these objects and functions in scope, the data frame is inserted at the bottom of the current chain of environments. It comes first and has precedence over the user environment. In other words, it masks the user environment.

Since masking blends the data and the user environment by giving priority to the former, R can sometimes use a data frame column when you really intended to use a local object.

# Defining an env-variable
cyl <- 1000

# Referring to a data-variable
dplyr::summarise(mtcars, mean(cyl))

The tidy eval framework provides [pronouns][.data] to help disambiguate between the mask and user contexts. It is often a good idea to use these pronouns in production code.

cyl <- 1000

mtcars %>%
  dplyr::summarise(
    mean_data = mean(.data$cyl),
    mean_env = mean(.env$cyl)
  )

Read more about this in r link("topic_data_mask_ambiguity").

How does data-masking work?

Data-masking relies on three language features:

r as.environment(mtcars) #> <environment: 0x7febb17e3468>

```r expr(1 + 1)

eval(expr(1 + 1)) ```

By default eval() and eval_tidy() evaluate in the current environment.

r code <- expr(mean(cyl + am)) eval(code)

You can supply an optional list or data frame that will be converted to an environment.

r eval(code, mtcars)

Evaluation of defused code then occurs in the context of a data mask.

History

The tidyverse embraced the data-masking approach in packages like ggplot2 and dplyr and eventually developed its own programming framework in the rlang package. None of this would have been possible without the following landmark developments from S and R authors.

See also



tidyverse/rlang documentation built on Oct. 31, 2024, 5:35 p.m.