extras/small_examples/column_selection.md

column selection

dplyr is inconsistent as to which column is selected unless one uses extra notation such as !!, {{}}, .data[[]], and so on. Of course if using a name or string directly are not the “correct” notation, why are they allowed? Notice how different columns are selected in each example, depending on the columns present in the data.frame. The issue is dplyr does not commit to an unambiguous interpretation of the basic notation (only the more complicated, longer notations have reliable semantics).

library("dplyr")
## 
## Attaching package: 'dplyr'

## The following objects are masked from 'package:stats':
## 
##     filter, lag

## The following objects are masked from 'package:base':
## 
##     intersect, setdiff, setequal, union
y = "x"

data.frame(x = 1) %>%
  select(y)
##   x
## 1 1
data.frame(x = 1, y = 2) %>%
  select(y)
##   y
## 1 2

dplyr notations that are unambiguous include:

data.frame(x = 1) %>%
  select({{y}})
##   x
## 1 1
data.frame(x = 1, y = 2) %>%
  select
## data frame with 0 columns and 1 row
data.frame(x = 1) %>%
  select(!!y)
##   x
## 1 1
data.frame(x = 1, y = 2) %>%
  select(!!y)
##   x
## 1 1
data.frame(x = 1) %>%
  select(!!rlang::enquo(y))
##   x
## 1 1
data.frame(x = 1, y = 2) %>%
  select(!!rlang::enquo(y))
##   x
## 1 1
data.frame(x = 1) %>%
  select(.data[[y]])
##   x
## 1 1
data.frame(x = 1, y = 2) %>%
  select(.data[[y]])
##   x
## 1 1

But other notations don’t work (.data is apparently a mapping from column names to column indices, and not in fact a reference to the incoming data.frame).

data.frame(x = 1) %>%
  select(.data[y])
## `.data[y]` must evaluate to column positions or names, not a list
data.frame(x = 1, y = 2) %>%
  select(.data[y])
## `.data[y]` must evaluate to column positions or names, not a list

R itself does not have this problem. Notice how the column named by y (which turns out to be x) is reliably chosen in all cases. In [] and [[]] notations columns are always values (not taken from code or variable names; and $ always take from code and not from values).

y = "x"

data.frame(x = 1)[y]
##   x
## 1 1
data.frame(x = 1, y = 2)[y]
##   x
## 1 1

rqdatable also has reliable column selection semantics, columns are always values (not taken from code or variable names).

library("rqdatatable")
## Loading required package: rquery
y = "x"

data.frame(x = 1) %.>% 
  select_columns(., y)
##    x
## 1: 1
data.frame(x = 1, y = 2) %.>% 
  select_columns(., y)
##    x
## 1: 1


WinVector/rquery documentation built on Aug. 24, 2023, 11:12 a.m.