title: "Invariants for subsetting and subassignment"
output: rmarkdown::html_vignette
vignette: > %\VignetteIndexEntry{invariants} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8}
.dftbl { width: 100%; table-layout: fixed; display: inline-table; } .error pre code { color: red; } .warning pre code { color: violet; }This vignette defines invariants for subsetting and subset-assignment
for tibbles, and illustrates where their behaviour differs from data
frames. The goal is to define a small set of invariants that
consistently define how behaviors interact. Some behaviors are defined
using functions of the vctrs package, e.g. vec_slice()
,
vec_recycle()
and vec_as_index()
. Refer to their documentation for
more details about the invariants that they follow.
The subsetting and subassignment operators for data frames and tibbles
are particularly tricky, because they support both row and column
indexes, both of which are optionally missing. We resolve this by first
defining column access with [[
and $
, then column-wise subsetting
with [
, then row-wise subsetting, then the composition of both.
In this article, all behaviors are demonstrated using one example data frame and its tibble equivalent:
library(vctrs)
library(tibble)
new_df <- function() {
df <- data.frame(n = c(1L, NA, 3L, NA))
df$c <- letters[5:8]
df$li <- list(9, 10:11, 12:14, "text")
df
}
new_tbl <- function() {
as_tibble(new_df())
}
Results of the same code for data frames and tibbles are presented side by side:
new_df() #> n c li #> 1 1 e 9 #> 2 NA f 10, 11 #> 3 3 g 12, 13, 14 #> 4 NA h text new_tbl() #> # A tibble: 4 x 3 #> n c li #> #> 1 1 e #> 2 NA f #> 3 3 g #> 4 NA hIf the results are identical (after converting to a data frame if necessary), only the tibble result is shown.
Subsetting operations are read-only. The same objects are reused in all examples:
df <- new_df()
tbl <- new_tbl()
Where needed, we also show examples with hierarchical columns containing a data frame or a matrix:
new_tbl2 <- function() {
tibble(
tb = tbl,
m = diag(4)
)
}
new_df2 <- function() {
df2 <- new_tbl2()
class(df2) <- "data.frame"
class(df2$tb) <- "data.frame"
df2
}
df2 <- new_df2()
tbl2 <- new_tbl2()
new_tbl()
#> # A tibble: 4 x 3
#> n c li
#>
#> 1 1 e
#> 2 NA f
#> 3 3 g
#> 4 NA h
For subset assignment (subassignment, for short), we need a fresh copy
of the data for each test. The with_*()
functions (omitted here for
brevity) allow for a more concise notation. These functions take an
assignment expression, execute it on a fresh copy of the data, and
return the data for printing. The first example prints what’s really
executed, further examples omit this output.
x[[j]]
x[[j]]
is equal to .subset2(x, j)
.
NB: x[[j]]
always returns an object of size nrow(x)
if the column
exists.
j
must be a single number or a string, as enforced by
.subset2(x, j)
.
NA
indexes, numeric out-of-bounds (OOB) values, and non-integers throw
an error:
Character OOB access is silent because a common package idiom is to
check for the absence of a column with is.null(df[[var]])
.
x$name
x$name
and x$"name"
are equal to x[["name"]]
.
Unlike data frames, tibbles do not partially match names. Because df$x
is rarely used in packages, it can raise a warning:
x[j]
j
is converted to an integer vector by
vec_as_index(j, ncol(x), names = names(x))
. Then
x[c(j_1, j_2, ..., j_n)]
is equivalent to
tibble(x[[j_1]], x[[j_2]], ..., x[[j_3]])
, keeping the corresponding
column names. This implies that j
must be a numeric or character
vector, or a logical vector with length 1 or ncol(x)
.[1]
When subsetting repeated indexes, the resulting column names are undefined, do not rely on them.
df[c(1, 1)] #> n n.1 #> 1 1 1 #> 2 NA NA #> 3 3 3 #> 4 NA NA tbl[c(1, 1)] #> # A tibble: 4 x 2 #> n n #> #> 1 1 1 #> 2 NA NA #> 3 3 3 #> 4 NA NAFor tibbles with repeated column names, subsetting by name uses the first matching column.
nrow(df[j])
equals nrow(df)
.
Tibbles support indexing by a logical matrix, but only if all values in the returned vector are compatible.
df[is.na(df)] #> [[1]] #> [1] NA #> #> [[2]] #> [1] NA tbl[is.na(tbl)] #> [1] NA NA df[!is.na(df)] #> [[1]] #> [1] 1 #> #> [[2]] #> [1] 3 #> #> [[3]] #> [1] "e" #> #> [[4]] #> [1] "f" #> #> [[5]] #> [1] "g" #> #> [[6]] #> [1] "h" #> #> [[7]] #> [1] 9 #> #> [[8]] #> [1] 10 11 #> #> [[9]] #> [1] 12 13 14 #> #> [[10]] #> [1] "text" tbl[!is.na(tbl)] #> Error: No common type for `n` #> and `c` .x[, j]
x[, j]
is equal to x[j]
. Tibbles do not perform column extraction if
x[j]
would yield a single column.
x[, j, drop = TRUE]
For backward compatiblity, x[, j, drop = TRUE]
performs column
extraction, returning x[j][[1]]
when ncol(x[j])
is 1.
x[i, ]
x[i, ]
is equal to
tibble(vec_slice(x[[1]], i), vec_slice(x[[2]], i), ...)
.[2]
This means that i
must be a numeric vector, or a logical vector of
length nrow(x)
or 1. For compatibility, i
can also be a character
vector containing positive numbers.
Exception: OOB values generate warnings instead of errors:
df[10, ] #> n c li #> NA NA NULL tbl[10, ] #> Warning: Row indexes must be between 0 #> and the number of rows (4). Use `NA` as #> row index to obtain a row full of `NA` #> values. #> # A tibble: 1 x 3 #> n c li #> #> 1 NA df["x", ] #> n c li #> NA NA NULL tbl["x", ] #> Warning: Only valid row names can be #> used for indexing. Use `NA` as row index #> to obtain a row full of `NA` values. #> # A tibble: 1 x 3 #> n c li #> #> 1 NAUnlike data frames, only logical vectors of length 1 are recycled.
df[c(TRUE, FALSE), ] #> n c li #> 1 1 e 9 #> 3 3 g 12, 13, 14 tbl[c(TRUE, FALSE), ] #> Error: Logical indices must have #> length 1 or be as long as the #> indexed vector. #> The vector has size 4 whereas the #> index has size 2.NB: scalar logicals are recycled, but scalar numerics are not. That
makes the x[NA, ]
and x[NA_integer_, ]
return different results.
x[i, , drop = TRUE]
drop = TRUE
has no effect when not selecting a single row:
x[]
and x[,]
x[]
and x[,]
are equivalent to x
.[3]
x[i, j]
x[i, j]
is equal to x[i, ][j]
.[4]
x[[i, j]]
i
must be a numeric vector of length 1. x[[i, j]]
is equal to
x[i, ][[j]]
.[5]
This implies that j
must be a numeric or character vector of length 1.
NB: vec_size(x[[i, j]])
always equals 1. Unlike x[i, ]
, x[[i, ]]
is not valid.
x[[j]] <- a
If a
is a vector then x[[j]] <- a
replaces the j
th column with
value a
.
a
is recycled to the same size as x
so must have size nrow(x)
or
1. (The only exception is when a
is NULL
, as described below.)
Recycling also works for list, data frame, and matrix columns.
j
must be a scalar numeric or a string, and cannot be NA
. If j
is
OOB, a new column is added on the right hand side, with name repair if
needed.
df[[j]] <- a
replaces the complete column so can change the type.
[[<-
supports removing a column by assigning NULL
to it.
Removing a nonexistent column is a no-op.
with_tbl(tbl[["q"]] <- NULL) #> # A tibble: 4 x 3 #> n c li #> #> 1 1 e #> 2 NA f #> 3 3 g #> 4 NA hx$name <- a
x$name <- a
and x$"name" <- a
are equivalent to
x[["name"]] <- a
.[6]
$<-
does not perform partial matching.
x[j] <- a
j
is missing, it’s replaced with seq_along(x)
j
is logical vector, it’s converted to numeric with
seq_along(x)[j]
.a
is a list or data frameIf inherits(a, "list")
or inherits(a, "data.frame")
is TRUE
, then
x[j] <- a
is equivalent to x[[j[[1]]] <- a[[1]]
,
x[[j[[2]]]] <- a[[2]]
, …
If length(a)
equals 1, then it is recycled to the same length as j
.
An attempt to update the same column twice gives an error.
with_df(df[c(1, 1)] <- list(1, 2)) #> Error in `[<-.data.frame`(`*tmp*`, #> c(1, 1), value = list(1, 2)): #> duplicate subscripts for columns with_tbl(tbl[c(1, 1)] <- list(1, 2)) #> Error: Column index 1 is used more #> than once for assignment.If a
contains NULL
values, the corresponding columns are removed
after updating (i.e. position indexes refer to columns before any
modifications).
NA
indexes are not supported.
Just like column updates, [<-
supports changing the type of an
existing column.
Appending columns at the end (without gaps) is supported. The name of new columns is determined by the LHS, the RHS, or by name repair (in that order of precedence).
with_tbl(tbl[c("x", "y")] <- tibble("x", x = 4:1)) #> # A tibble: 4 x 5 #> n c li x y #> #> 1 1 e x 4 #> 2 NA f x 3 #> 3 3 g x 2 #> 4 NA h x 1 with_tbl(tbl[3:4] <- list("x", x = 4:1)) #> # A tibble: 4 x 4 #> n c li x #> #> 1 1 e x 4 #> 2 NA f x 3 #> 3 3 g x 2 #> 4 NA h x 1 with_df(df[4] <- list(4:1)) #> n c li V4 #> 1 1 e 9 4 #> 2 NA f 10, 11 3 #> 3 3 g 12, 13, 14 2 #> 4 NA h text 1 with_tbl(tbl[4] <- list(4:1)) #> # A tibble: 4 x 4 #> n c li ...4 #> #> 1 1 e 4 #> 2 NA f 3 #> 3 3 g 2 #> 4 NA h 1 with_df(df[5] <- list(4:1)) #> Error in `[<-.data.frame`(`*tmp*`, #> 5, value = list(4:1)): new columns #> would leave holes after existing #> columns with_tbl(tbl[5] <- list(4:1)) #> Error: Can't assign column 5 in a #> tibble with 3 columns.Tibbles support indexing by a logical matrix, but only for a scalar RHS, and if all columns updated are compatible with the value assigned.
with_df(df[is.na(df)] <- 4) #> n c li #> 1 1 e 9 #> 2 4 f 10, 11 #> 3 3 g 12, 13, 14 #> 4 4 h text with_tbl(tbl[is.na(tbl)] <- 4) #> # A tibble: 4 x 3 #> n c li #> #> 1 1 e #> 2 4 f #> 3 3 g #> 4 4 h with_df(df[is.na(df)] <- 1:2) #> n c li #> 1 1 e 9 #> 2 1 f 10, 11 #> 3 3 g 12, 13, 14 #> 4 2 h text with_tbl(tbl[is.na(tbl)] <- 1:2) #> Error in tbl_subassign_matrix(x, j, #> value): vec_is(value, size = 1) is #> not TRUE with_df(df[matrix(c(rep(TRUE, 5), rep(FALSE, 7)), ncol = 3)] <- 4) #> n c li #> 1 4 4 9 #> 2 4 f 10, 11 #> 3 4 g 12, 13, 14 #> 4 4 h text with_tbl(tbl[matrix(c(rep(TRUE, 5), rep(FALSE, 7)), ncol = 3)] <- 4) #> Error: No common type for `value` #> and `x` .a
is another type of vectorIf vec_is(a)
, then x[j] <- a
is equivalent to x[j] <- list(a)
.
This is primarily provided for backward compatbility.
Matrices are vectors, so they are also wrapped in list()
before
assignment. This consistently creates matrix columns, unlike data
frames, which creates matrix columns when assigning to one column, but
treats the matrix like a data frame when assigning to more than one
column.
a
is not a vectorAny other type for a
is an error. Note that if is.list(a)
is TRUE
,
but inherits(a, "list")
is FALSE
, then a
is considered to be a
scalar. See ?vec_is
and ?vec_proxy
for details.
x[i, ] <- list(...)
x[i, ] <- a
is the same as vec_slice(x[[j_1]], i) <- a[[1]]
,
vec_slice(x[[j_2]], i) <- a[[2]]
, … .[7]
Only values of size one can be recycled.
with_tbl(tbl[2:3, ] <- tbl[1, ]) #> # A tibble: 4 x 3 #> n c li #> #> 1 1 e #> 2 1 e #> 3 1 e #> 4 NA h with_tbl(tbl[2:3, ] <- list(tbl$n[1], tbl$c[1:2], tbl$li[1])) #> # A tibble: 4 x 3 #> n c li #> #> 1 1 e #> 2 1 e #> 3 1 f #> 4 NA h with_df(df[2:4, ] <- df[1:2, ]) #> Error in `[<-.data.frame`(`*tmp*`, #> 2:4, , value = structure(list(n = #> c(1L, : replacement element 1 has 2 #> rows, need 3 with_tbl(tbl[2:4, ] <- tbl[1:2, ]) #> Error: Vector of length 2 cannot be #> recycled to length 3. Only vectors #> of length one can be recycled. with_df2(df2[2:4, ] <- df2[1, ]) #> Error in `[<-.data.frame`(`*tmp*`, #> 2:4, , value = structure(list(tb = #> structure(list(: replacement element #> 1 is a matrix/data frame of 1 row, #> need 3 with_tbl2(tbl2[2:4, ] <- tbl2[1, ]) #> # A tibble: 4 x 2 #> tb$n $c $li m[,1] [,2] [,3] #> #> 1 1 e 2 1 e 3 1 e 4 1 e # … with 1 more variable: [,4] with_df2(df2[2:4, ] <- df2[2:3, ]) #> Error in `[<-.data.frame`(`*tmp*`, #> 2:4, , value = structure(list(tb = #> structure(list(: replacement element #> 1 is a matrix/data frame of 2 rows, #> need 3 with_tbl2(tbl2[2:4, ] <- tbl2[2:3, ]) #> Error: Vector of length 2 cannot be #> recycled to length 3. Only vectors #> of length one can be recycled.For compatibility, only a warning is issued for indexing beyond the number of rows. Appending rows right at the end of the existing data is supported, without warning.
with_tbl(tbl[5, ] <- tbl[1, ]) #> # A tibble: 5 x 3 #> n c li #> #> 1 1 e #> 2 NA f #> 3 3 g #> 4 NA h #> 5 1 e with_tbl(tbl[5:7, ] <- tbl[1, ]) #> # A tibble: 7 x 3 #> n c li #> #> 1 1 e #> 2 NA f #> 3 3 g #> 4 NA h #> 5 1 e #> 6 1 e #> 7 1 e with_df(df[6, ] <- df[1, ]) #> n c li #> 1 1 e 9 #> 2 NA f 10, 11 #> 3 3 g 12, 13, 14 #> 4 NA h text #> 5 NA NULL #> 6 1 e 9 with_tbl(tbl[6, ] <- tbl[1, ]) #> Error: Can't assign row 6 in a #> tibble with 4 rows. with_df(df[-5, ] <- df[1, ]) #> n c li #> 1 1 e 9 #> 2 1 e 9 #> 3 1 e 9 #> 4 1 e 9 with_tbl(tbl[-5, ] <- tbl[1, ]) #> Error: Must index existing elements. #> [31mx[39m Can't subset position 5. #> [34mℹ[39m There are only 4 #> elements. with_df(df[-(5:7), ] <- df[1, ]) #> n c li #> 1 1 e 9 #> 2 1 e 9 #> 3 1 e 9 #> 4 1 e 9 with_tbl(tbl[-(5:7), ] <- tbl[1, ]) #> Error: Must index existing elements. #> [31mx[39m Can't subset positions #> 5, 6 and 7. #> [34mℹ[39m There are only 4 #> elements. with_df(df[-6, ] <- df[1, ]) #> n c li #> 1 1 e 9 #> 2 1 e 9 #> 3 1 e 9 #> 4 1 e 9 with_tbl(tbl[-6, ] <- tbl[1, ]) #> Error: Must index existing elements. #> [31mx[39m Can't subset position 6. #> [34mℹ[39m There are only 4 #> elements.For compatibility, i
can also be a character vector containing
positive numbers.
x[i, j] <- a
x[i, j] <- a
is equivalent to x[i, ][j] <- a
.[8]
Subassignment to x[i, j]
is stricter for tibbles than for data frames.
x[i, j] <- a
can’t change the data type of existing columns.
For new columns, x[i, j] <- a
fills the unassigned rows with NA
.
Likewise, for new rows, x[i, j] <- a
fills the unassigned columns with
NA
.
x[[i, j]] <- a
i
must be a numeric vector of length 1. x[[i, j]] <- a
is equivalent
to x[i, ][[j]] <- a
.[9]
NB: vec_size(a)
must equal 1. Unlike x[i, ] <-
, x[[i, ]] <-
is not
valid.
[1] x[j][[jj]]
is equal to x[[ j[[jj]] ]]
, in particular x[j][[1]]
is equal to x[[j]]
for scalar numeric or integer j
.
[2] Row subsetting x[i, ]
is not defined in terms of x[[j]][i]
because that definition does not generalise to matrix and data frame
columns. For efficiency and backward compatibility, i
is converted to
an integer vector by vec_as_index(i, nrow(x))
first.
[3] x[,]
is equivalent to x[]
because x[, j]
is equivalent to
x[j]
.
[4] A more efficient implementation of x[i, j]
would forward to
x[j][i, ]
.
[5] Cell subsetting x[[i, j]]
is not defined in terms of x[[j]][[i]]
because that definition does not generalise to list, matrix and data
frame columns. A more efficient implementation of x[[i, j]]
would
check that j
is a scalar and forward to x[i, j][[1]]
.
[6] $
behaves almost completely symmetrically to [[
when comparing
subsetting and subassignment.
[7] x[i, ]
is symmetrically for subset and subassignment.
[8] x[i, j]
is symmetrically for subsetting and subassignment. A more
efficient implementation of x[i, j] <- a
would forward to
x[j][i, ] <- a
.
[9] x[[i, j]]
is symmetrically for subsetting and subassignment. An
efficient implementation would check that i
and j
are scalar and
forward to x[i, j][[1]] <- a
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.