Fast dcast for data.table
dcast.data.table is a much faster version of
reshape2::dcast, but for
data.tables. More importantly, it's capable of handling very large data quite efficiently in terms of memory usage in comparison to
dcast is a implemented as a S3 generic in
data.table. To melt or cast data.tables, it is not necessary to load
reshape2 anymore. If you have to, then load
reshape2 package before loading
dcast.data.table can now cast multiple
value.var columns and also accepts multiple functions under
fun.aggregate argument. See
examples for more.
1 2 3 4 5
A formula of the form LHS ~ RHS to cast, see details.
Should the data be aggregated before casting? If the formula doesn't identify single observation for each cell, then aggregation defaults to
NEW: it is possible to provide a list of functions to
Any other arguments that maybe passed to the aggregating function.
Not implemented yet. Should take variable names to compute margins on. A value of
Specified if casting should be done on subset of the data. Ex: subset = .(col1 <= 5) or subset = .(variable != "January").
Value to fill missing cells with. If
Name of the column whose values will be filled to cast. Function 'guess()' tries to, well, guess this column automatically, if none is provided.
NEW: it is possible to cast multiple
Not used yet. Maybe dropped in the future or used to provide information messages onto the console.
The cast formula takes the form
LHS ~ RHS , ex:
var1 + var2 ~ var3. The order of entries in the formula is essential. There are two special variables:
.... Their functionality is identical to that of
dcast also allows
value.var columns of type
When variable combinations in
formula doesn't identify a unique value in a cell,
fun.aggregate will have to be specified, which defaults to
length if unspecified. The aggregating function should take a vector as input and return a single value (or a list of length one) as output. In cases where
value.var is a list, the function should be able to handle a list input and provide a single value or list of length one as output.
If the formula's LHS contains the same column more than once, ex:
dcast(DT, x+x~ y), then the answer will have duplicate names. In those cases, the duplicate names are renamed using
make.unique so that key can be set without issues.
Names for columns that are being cast are generated in the same order (separated by an underscore,
_) from the (unique) values in each column mentioned in the formula RHS.
dcast tries to preserve attributes whereever possible.
v1.9.6, it is possible to cast multiple
value.var columns and also cast by providing multiple
fun.aggregate functions. Multiple
fun.aggregate functions should be provided as a
list, for e.g.,
list(mean, sum, function(x) paste(x, collapse="").
value.var can be either a character vector or list of length=1, or a list of length equal to
value.var is a character vector or a list of length 1, each function mentioned under
fun.aggregate is applied to every column specified under
value.var column. When
value.var is a list of length equal to
length(fun.aggregate) each element of
fun.aggregate is appled to each element of
data.table that has been cast. The key columns are equal to the variables in the
formula LHS in the same order.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37
require(data.table) names(ChickWeight) <- tolower(names(ChickWeight)) DT <- melt(as.data.table(ChickWeight), id=2:4) # calls melt.data.table # dcast is a S3 method in data.table from v1.9.6 dcast(DT, time ~ variable, fun=mean) dcast(DT, diet ~ variable, fun=mean) dcast(DT, diet+chick ~ time, drop=FALSE) dcast(DT, diet+chick ~ time, drop=FALSE, fill=0) # using subset dcast(DT, chick ~ time, fun=mean, subset=.(time < 10 & chick < 20)) ## Not run: # benchmark against reshape2's dcast, minimum of 3 runs set.seed(45) DT <- data.table(aa=sample(1e4, 1e6, TRUE), bb=sample(1e3, 1e6, TRUE), cc = sample(letters, 1e6, TRUE), dd=runif(1e6)) system.time(dcast(DT, aa ~ cc, fun=sum)) # 0.12 seconds system.time(dcast(DT, bb ~ cc, fun=mean)) # 0.04 seconds # reshape2::dcast takes 31 seconds system.time(dcast(DT, aa + bb ~ cc, fun=sum)) # 1.2 seconds ## End(Not run) # NEW FEATURE - multiple value.var and multiple fun.aggregate dt = data.table(x=sample(5,20,TRUE), y=sample(2,20,TRUE), z=sample(letters[1:2], 20,TRUE), d1 = runif(20), d2=1L) # multiple value.var dcast(dt, x + y ~ z, fun=sum, value.var=c("d1","d2")) # multiple fun.aggregate dcast(dt, x + y ~ z, fun=list(sum, mean), value.var="d1") # multiple fun.agg and value.var (all combinations) dcast(dt, x + y ~ z, fun=list(sum, mean), value.var=c("d1", "d2")) # multiple fun.agg and value.var (one-to-one) dcast(dt, x + y ~ z, fun=list(sum, mean), value.var=list("d1", "d2"))
Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.