Fast dcast for data.table
Description
dcast.data.table
is a much faster version of reshape2::dcast
, but for data.table
s. More importantly, it's capable of handling very large data quite efficiently in terms of memory usage in comparison to reshape2::dcast
.
From 1.9.6, dcast
is implemented as an S3 generic in data.table
. To melt or cast data.table
s, it is not necessary to load reshape2
anymore. If you have load reshape2
, do so before loading data.table
to prevent unwanted masking.
NEW: dcast.data.table
can now cast multiple value.var
columns and also accepts multiple functions to fun.aggregate
. See Examples for more.
Usage
1 2 3 4 5 
Arguments
data 
A 
formula 
A formula of the form LHS ~ RHS to cast, see Details. 
fun.aggregate 
Should the data be aggregated before casting? If the formula doesn't identify a single observation for each cell, then aggregation defaults to NEW: it is possible to provide a list of functions to 
sep 
Character vector of length 1, indicating the separating character in variable names generated during casting. Default is 
... 
Any other arguments that may be passed to the aggregating function. 
margins 
Not implemented yet. Should take variable names to compute margins on. A value of 
subset 
Specified if casting should be done on a subset of the data. Ex: 
fill 
Value with which to fill missing cells. If 
drop 
NEW: Following #1512, 
value.var 
Name of the column whose values will be filled to cast. Function 'guess()' tries to, well, guess this column automatically, if none is provided. NEW: it is now possible to cast multiple 
verbose 
Not used yet. May be dropped in the future or used to provide informative messages through the console. 
Details
The cast formula takes the form LHS ~ RHS
, ex: var1 + var2 ~ var3
. The order of entries in the formula is essential. There are two special variables: .
and ...
. .
represents no variable; ...
represents all variables not otherwise mentioned in formula
; see Examples.
dcast
also allows value.var
columns of type list
.
When variable combinations in formula
doesn't identify a unique value in a cell, fun.aggregate
will have to be specified, which defaults to length
if unspecified. The aggregating function should take a vector as input and return a single value (or a list of length one) as output. In cases where value.var
is a list, the function should be able to handle a list input and provide a single value or list of length one as output.
If the formula's LHS contains the same column more than once, ex: dcast(DT, x+x~ y)
, then the answer will have duplicate names. In those cases, the duplicate names are renamed using make.unique
so that key can be set without issues.
Names for columns that are being cast are generated in the same order (separated by an underscore, _
) from the (unique) values in each column mentioned in the formula RHS.
From v1.9.4
, dcast
tries to preserve attributes whereever possible.
NEW: From v1.9.6
, it is possible to cast multiple value.var
columns and also cast by providing multiple fun.aggregate
functions. Multiple fun.aggregate
functions should be provided as a list
, for e.g., list(mean, sum, function(x) paste(x, collapse="")
. value.var
can be either a character vector or list of length=1, or a list of length equal to length(fun.aggregate)
. When value.var
is a character vector or a list of length 1, each function mentioned under fun.aggregate
is applied to every column specified under value.var
column. When value.var
is a list of length equal to length(fun.aggregate)
each element of fun.aggregate
is appled to each element of value.var
column.
Value
A keyed data.table
that has been cast. The key columns are equal to the variables in the formula
LHS in the same order.
See Also
melt.data.table
, rowid
, https://cran.rproject.org/package=reshape
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59  require(data.table)
names(ChickWeight) < tolower(names(ChickWeight))
DT < melt(as.data.table(ChickWeight), id=2:4) # calls melt.data.table
# dcast is a S3 method in data.table from v1.9.6
dcast(DT, time ~ variable, fun=mean)
dcast(DT, diet ~ variable, fun=mean)
dcast(DT, diet+chick ~ time, drop=FALSE)
dcast(DT, diet+chick ~ time, drop=FALSE, fill=0)
# using subset
dcast(DT, chick ~ time, fun=mean, subset=.(time < 10 & chick < 20))
# drop argument, #1512
DT < data.table(v1 = c(1.1, 1.1, 1.1, 2.2, 2.2, 2.2),
v2 = factor(c(1L, 1L, 1L, 3L, 3L, 3L), levels=1:3),
v3 = factor(c(2L, 3L, 5L, 1L, 2L, 6L), levels=1:6),
v4 = c(3L, 2L, 2L, 5L, 4L, 3L))
# drop=TRUE
dcast(DT, v1 + v2 ~ v3) # default is drop=TRUE
dcast(DT, v1 + v2 ~ v3, drop=FALSE) # all missing combinations of both LHS and RHS
dcast(DT, v1 + v2 ~ v3, drop=c(FALSE, TRUE)) # all missing combinations of only LHS
dcast(DT, v1 + v2 ~ v3, drop=c(TRUE, FALSE)) # all missing combinations of only RHS
# using . and ...
DT < data.table(v1 = rep(1:2, each = 6),
v2 = rep(rep(1:3, 2), each = 2),
v3 = rep(1:2, 6),
v4 = rnorm(6))
dcast(DT, ... ~ v3, value.var = "v4") #same as v1 + v2 ~ v3, value.var = "v4"
dcast(DT, v1 + v2 + v3 ~ ., value.var = "v4")
## for each combination of (v1, v2), add up all values of v4
dcast(DT, v1 + v2 ~ ., value.var = "v4", fun.aggregate = sum)
## Not run:
# benchmark against reshape2's dcast, minimum of 3 runs
set.seed(45)
DT < data.table(aa=sample(1e4, 1e6, TRUE),
bb=sample(1e3, 1e6, TRUE),
cc = sample(letters, 1e6, TRUE), dd=runif(1e6))
system.time(dcast(DT, aa ~ cc, fun=sum)) # 0.12 seconds
system.time(dcast(DT, bb ~ cc, fun=mean)) # 0.04 seconds
# reshape2::dcast takes 31 seconds
system.time(dcast(DT, aa + bb ~ cc, fun=sum)) # 1.2 seconds
## End(Not run)
# NEW FEATURE  multiple value.var and multiple fun.aggregate
dt = data.table(x=sample(5,20,TRUE), y=sample(2,20,TRUE),
z=sample(letters[1:2], 20,TRUE), d1 = runif(20), d2=1L)
# multiple value.var
dcast(dt, x + y ~ z, fun=sum, value.var=c("d1","d2"))
# multiple fun.aggregate
dcast(dt, x + y ~ z, fun=list(sum, mean), value.var="d1")
# multiple fun.agg and value.var (all combinations)
dcast(dt, x + y ~ z, fun=list(sum, mean), value.var=c("d1", "d2"))
# multiple fun.agg and value.var (onetoone)
dcast(dt, x + y ~ z, fun=list(sum, mean), value.var=list("d1", "d2"))
