library("table.express") library("data.table") # knitr defaults knitr::opts_chunk$set(comment = "#>") options("datatable.print.nrows" = 4L) options("datatable.print.topn" = 2L)
The goal of this package is to offer an alternative way of expressing common operations with
data.table
without sacrificing the performance optimizations that it offers.
The foundation for the data manipulation verbs is the
dplyr
package,
which also advocates the piping operator from the
magrittr
package.
The rlang
package powers most of this package's functionality,
which means that
tidy evaluation
can also be supported.
There are other resources that provide comprehensive descriptions of these packages,
so they will not be explained here.
Even though using data manipulation verbs can improve expressiveness in some cases,
this is not always true,
so using the traditional data.table
syntax might still be preferable in many situations.
Unfortunately, it is not entirely straightforward to use the verbs without considerable knowledge of the syntax understood by data.table
,
as well as its optimizations
and special symbols.
We will return to the importance of this at the end.
All verbs defined in this package essentially parse the input given to them and build expressions that data.table
understands.
These expressions are used in a data.table
frame,
i.e. in a call similar to DT[i, j, by, ...]
where DT
is a data.table
.
The frame's main elements are the 3 shown:
i
, j
, and by
.
Let us first list the available single-table verbs and their effects in the frame
(two-table joining verbs have their own vignette):
i
:arrange
/order_by
filter
filter_on
filter_sd
where
max_by
min_by
j
:select
mutate
mutate_sd
transmute
transmute_sd
group_by
to specify by
key_by
to specify keyby
distinct
specifies j
and by
The where
verb is equivalent to dplyr
's filter
,
but is always lazy,
i.e. it expects another verb to be used afterwards.
The order_by
verb is simply an alias for arrange
.
Some verbs can/must be combined with others,
which means that some expressions must be built lazily.
It is possible to work in an entirely lazy manner (see below),
but, as of version 0.3.0, it is no longer necessary
(do note that this makes table.express
conflict with
dtplyr
).
Nevertheless, in order to enable some of the expressions to be built lazily,
the order in which the verbs are used becomes important:
group_by
/key_by
should be used first.i
should be used next,
usually specifying whether they should evaluate the frame or not.j
should be used last.The examples here explain this in more detail with the mtcars
data:
data("mtcars") DT <- mtcars %>% as.data.table %T>% print
The foundation for this package is building expressions that are almost entirely delegated to data.table
.
In order to explicitly show when an expression is being built and subsequently evaluated,
we can use 3 delimiters:
start_expr
end_expr
chain
These also serve as visual reminders that we are not dealing directly with data.table
s during the process.
We capture the input data.table
and start the process with start_expr
,
and evaluate the final expression with end_expr
.
Using chain
is equivalent to calling end_expr
immediately followed by start_expr
.
In order to print more information regarding the expression-building process,
we can either set options(table.express.verbose = TRUE)
,
or pass .verbose = TRUE
to start_expr
or chain
.
Let's start with a lazy example to get an idea of how the built expressions look like.
The arrange
/order_by
verbs add an expression with order
to the frame,
and let data.table
handle it as usual:
DT %>% start_expr %>% order_by(mpg, -cyl)
We see here that the built expression includes a .DT_
pronoun.
When the expression is evaluated,
the captured data.table
is assigned to the evaluation environment as said pronoun.
When evaluating eagerly,
this whole process happens in the background:
DT %>% order_by(mpg, -cyl)
Even though selecting a subset of columns is a common operation,
it may be undesirable to do so when working with data.table
s because it always leads to deep data copies.
Given that data.table
supports modification by reference,
these copies are necessary.
For example, the following would require copying all columns between mpg
and disp
,
since any of them could be modified by reference afterwards.
DT[, mpg:disp][, cyl := as.integer(cyl)]
Therefore, it might be preferable to use mutation verbs to remove columns that are not of interest.
With that said,
the select
verb can be used as usual:
DT %>% select(mpg, am)
To maintain consistency,
even single columns are kept as data.table
s:
DT %>% select(mpg)
Calls to tidyselect
's helpers or to :
are handled specially internally:
DT %>% select(mpg:disp, gear:carb) DT %>% select(contains("M", ignore.case = TRUE))
Tidy evaluation and the .parse
argument can also aid in cases where certain parts of the frame were computed programmatically:
selected <- c("mpg", "am") DT %>% select(!!!selected, .parse = TRUE)
The transmute
verb simply wraps everything in its ellipsis in a call to list
and assigns the expression to the frame's j
.
DT %>% transmute(foo = mpg * 2, bar = exp(cyl))
Even though select
can be used for the same
(given the way data.table
handles j
),
the simplicity of transmute
's internals makes it preferable when no tidyselect
helpers are needed,
since it avoids extra function calls and can reduce overhead significantly in special circumstances.
Because of the way data.table
handles j
,
transmute
could be used to summarize columns.
However, the summarize
/summarise
verbs can be more expressive,
and they check whether the summary value actually has length 1.
The mutate
verb builds an expression with :=
in order to perform assignment by reference by default.
DT %>% start_expr %>% mutate(mpg = mpg / 2, hp = log(hp)) DT %>% copy %>% mutate(mpg = mpg / 2, hp = log(hp)) %>% { invisible(print(.)) }
It can also support expressions that already contain :=
:
new_vars <- c("x", "y") DT %>% copy %>% mutate(!!new_vars := .(1, 2), .unquote_names = FALSE) %>% { invisible(print(.)) }
The where
/filter
verbs work with the i
part of the frame:
DT %>% filter(vs == 1L, carb > 2L) DT %>% where(vs == 1L, carb > 2L) %>% transmute(mean_mpg = mean(mpg))
As mentioned in the beginning,
where
can be used as a lazy version of filter
,
letting us add other clauses to the same frame.
The helper verb filter_sd
can be used to apply the same conditions to many columns,
and it can use a special pronoun .COL
while specifying the expression,
lambdas as formulas,
as well as tidyselect
helpers to choose .SDcols
:
DT %>% filter_sd(c("mpg", "qsec"), `>`, 20) DT %>% filter_sd(c("mpg", "qsec"), .COL > 20) DT %>% start_expr %>% filter_sd(contains("m"), ~ .x > 0) # like dplyr's filter_if DT %>% start_expr %>% filter_sd(~ is.numeric(.x), .x > 0)
In order to enable combination of filtering helpers with other verbs,
the .expr
parameter can be used to indicate that the expression should be kept,
delaying evaluation until a verb that sets j
is used:
DT %>% copy %>% filter_sd(c("vs", "am"), .COL > 0, .expr = TRUE) %>% mutate(some_flag = TRUE) %>% { invisible(print(.)) }
The filter_on
verb can be used to build an expression that specifies the on
argument of the frame.
It accepts key-value pairs where each key is a column in the data,
and each value is the corresponding value that the column should have to match:
DT %>% start_expr %>% filter_on(cyl = 6, am = 0) %T>% print %>% end_expr DT %>% filter_on(cyl = 6, am = 0, mult = "first")
In order to support functionality similar to data.table
's DT[, lapply(.SD, fun), .SDcols = c("...")]
syntax,
2 verbs are provided: mutate_sd
and transmute_sd
.
Starting with mutate_sd
,
it modifies columns in .SDcols
by reference,
and columns that are not part of .SDcols
are kept:
DT %>% mutate_sd(c("mpg", "cyl"), exp) print(DT) DT %>% mutate_sd(c("mpg", "cyl"), log) print(DT)
It would also be possible to remove several columns with mutate_sd
:
DT %>% copy %>% mutate_sd(c("mpg", "cyl"), NULL) %>% { invisible(print(.)) }
Additionally, mutate_sd
supports the special .COL
pronoun that symbolizes the column that should be modified,
as well as lambdas as formulas:
# like dplyr's mutate_if DT %>% start_expr %>% mutate_sd(all(.COL %% 1 == 0), as.integer) DT %>% start_expr %>% mutate_sd(~ all(.x %% 1 == 0), as.integer)
On the other hand, transmute_sd
never modifies by reference,
and supports special expressions to "build" what is chosen as .SDcols
.
These expressions can use tidyselect
helpers,
as well as another special pronoun: .COLNAME
:
DT %>% transmute_sd(starts_with("d"), .COL * 2) DT %>% transmute_sd(grepl("^d", .COLNAME), .COL * 2) # like dplyr's transmute_if DT %>% start_expr %>% transmute_sd(~ is.numeric(.x), ~ .x * 2)
Both verbs also support a list of functions for multiple transformations,
but mutate_sd
performs pairwise transformations by default,
whereas transmute_sd
performs all combinations:
DT %>% copy %>% mutate_sd(contains("m"), .(round, -1L)) %>% { invisible(print(.)) } DT %>% transmute_sd(contains("m"), .(min, max, mean))
Since data.table
already supports this by means of its by
parameter,
the group_by
verb simply parses its input and assigns it accordingly:
DT %>% group_by(gear) %>% select(.N)
The key_by
verb does the same but sets the key of the input in order to sort:
DT %>% key_by(gear) %>% select(.N)
Using group_by
or key_by
by itself would not be useful.
Therefore, they are entirely lazy,
and simply return the expression builder that is used in the background.
Using other verbs will normally lead to the evaluation of the expression,
though some verbs can delay evaluation,
e.g. where
and other verbs that set i
when their respective .expr
parameters are set to TRUE
.
The distinct
verb is a shortcut for:
select(.SD[1])
with a by
clause if .keep = TRUE
..SDcols
if .keep
is a character vector.unique(.SD)
with specific .SDcols
if .keep = FALSE
.DT %>% distinct(vs, am) DT %>% distinct(vsam = vs + am, .keep = names(DT)) DT %>% distinct(vs, am, .keep = FALSE)
When working lazily,
i.e. with start_expr
/chain
/end_expr
,
there are special considerations that need to be taken into account.
Some data.table
idioms benefit from nesting expressions,
for instance this idiom to find,
by groups, rows where a column has its extrema.
The helper function nest_expr
can help in these cases.
When used inside a supported verb,
it eagerly builds a functional chain with %>%
,
where the currently captured data.table
enters as .
For example, the linked idiom could be expressed as:
DT %>% start_expr(.verbose = TRUE) %>% where(nest_expr(.start = FALSE, { .[, .I[mpg == max(mpg)], by = vs]$V1 })) %>% end_expr
Nevertheless, this specific logic has been encapsulated in the max_by
/min_by
verbs.
By default, the verbs defined in this package automatically start a new frame whenever they want to define one of i
/j
/by
,
but the current expression's frame has already specified it;
otherwise they add to the current frame.
More complex expressions are thus supported by automatically chaining data.table
frames:
DT %>% start_expr %>% select(mean_mpg = mean(mpg)) %>% where(hp > 50L) %>% group_by(vs, am, gear) %>% order_by(gear, -vs, am) %>% filter(mean_mpg > 20) %T>% print %>% end_expr %>% { invisible(print(., nrows = 10L)) }
If we wanted to be explicit about chaining whenever possible
(see below),
we could set options(table.express.chain = FALSE)
,
which would lead to a warning being shown whenever a part of the query is replaced.
The automatic chaining mentioned above is enough in most situations. For example the following chains lead to the same result, and therefore have the same semantics:
DT[mpg > 20, mpg * 2] DT[mpg > 20][, mpg * 2]
However, these two chains have different semantics:
DT[, .(mpg = mpg * 2)][mpg > 40] DT[mpg > 40, .(mpg = mpg * 2)]
As mentioned above,
chain
can be used to chain expressions by evaluating the current one with end_expr
,
and immediately capturing the resulting data.table
to start building a new expression.
This can be helpful in situations where automatic chaining
(or lack thereof)
can lead to a change in the expression's semantics:
DT %>% start_expr %>% transmute(mpg = mpg * 2) %>% filter(mpg > 40) %T>% print %>% end_expr DT %>% start_expr %>% transmute(mpg = mpg * 2) %>% chain %>% filter(mpg > 40) %>% end_expr
In the following cases, the mentioned verbs use the captured data.table
eagerly during expression building,
even when working lazily:
filter_sd
when .SDcols
is a tidyselect
call or a predicate using .COL
.mutate_sd
when .SDcols
is a tidyselect
call or a predicate using .COL
.nest_expr
in:anti_join
's y
.inner_join
's y
.right_join
's y
.filter
or where
.max_by
and min_by
, since they are based on nest_expr
.This can lead to unexpected results if we don't keep in mind the expression that is built:
DT[, mpg:disp] DT %>% start_expr %>% select(mpg:disp) %>% transmute(ans = sqrt(mpg)) %>% filter_sd(contains("m"), .COL > 0)
The select
gets rid of am
,
but filter_sd
sees the columns of DT
before any expression has been evaluated.
Explicit chaining can help in these cases,
capturing intermediate results:
DT %>% start_expr %>% select(mpg:disp) %>% chain %>% transmute(ans = sqrt(mpg)) %>% filter_sd(contains("m"), .COL > 0)
Many of the verbs in this package try to help make code more concise whenever possible.
However, some of the added convenience requires helper functions that aid with non-standard evaluation,
which doesn't always play nicely with the data.table
optimizations mentioned in the beginning.
Therefore, it is worth keeping the built expressions as close to traditional data.table
syntax as possible.
The verbs try to stick to this principle,
but in order to build a simple output expression,
they also require a simple input.
Let us illustrate this point with transmute_sd
,
which can be used to compute summaries:
DT %>% start_expr %>% group_by(am, vs) %>% transmute_sd(c("mpg", "disp"), mean) %>% frame_append(verbose = TRUE) %T>% print %>% end_expr
We can see, for example, lapply optimization changed j ...
.
Changing the expressions just a bit,
without affecting its semantics,
still hinders data.table
's ability to recognize cases it can optimize:
DT %>% start_expr %>% group_by(am, vs) %>% transmute_sd(c("mpg", "disp"), mean(.COL)) %>% frame_append(verbose = TRUE) %T>% print %>% end_expr
Corroborated by the statements j unchanged
, GForce FALSE
, etc.
Nevertheless, a different syntax can enable additional functionality,
for instance column renaming:
DT %>% group_by(am, vs) %>% transmute_sd(c("mpg", "disp"), .(avg = mean(.COL), min = min(.COL)))
Thus, in order to decide which syntax to use,
we must be aware of what can be optimized by data.table
,
but as a general rule of thumb,
the simpler the better.
DT %>% start_expr %>% group_by(am, vs) %>% transmute_sd(c("mpg", "disp"), .(min, max)) %>% frame_append(verbose = TRUE) %T>% print %>% end_expr
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.