dcast.data.table: Fast dcast for data.table

Description Usage Arguments Details Value Note See Also Examples

Description

A dcast.data.table is a much faster version of reshape2:::dcast, but for data.table. More importantly, it's able to handle bigger data quite efficiently without compromising speed. This is still under development, meaning it's stable, but not all features are complete yet. Once complete, we plan to make it an S3 generic by making changes to reshape2:::dcast. Then by loading both data.table and reshape2, one could use dcast on data.table as one would on a data.frame.

If you think of a particular feature that might be useful, then file a feature request (FR) at the datatable projects page (link at the bottom).

Usage

1
2
3
4
5
## fast dcast a data.table (not an S3 method yet)
dcast.data.table(data, formula, fun.aggregate = NULL, 
	..., margins = NULL, subset = NULL, fill = NULL, 
	drop = TRUE, value.var = guess(data),
	verbose = getOption("datatable.verbose"))

Arguments

data

A molten data.table object, see melt.data.table

formula

A formula of the form LHS ~ RHS to cast, see details.

fun.aggregate

Should the data be aggregated before casting? If the formula doesn't identify single observation for each cell, then aggregation defaults to length with a message.

...

Any other arguments that maybe passed to the aggregating function.

margins

Not implemented yet. Should take variable names to compute margins on. A value of TRUE would compute all margins.

subset

Specified if casting should be done on subset of the data. Ex: subset = .(col1 <= 5) or subset = .(variable != "January").

fill

Value to fill missing cells with. If fun.aggregate is present, takes the value by applying the function on 0-length vector.

drop

FALSE will cast by including all missing combinations.

value.var

Name of the column whose values will be filled to cast. Function 'guess()' tries to, well, guess this column automatically, if none is provided.

verbose

Not used yet. Maybe dropped in the future or used to provide information messages onto the console.

Details

The cast formula takes the form LHS ~ RHS , ex: var1 + var2 ~ var3. The order of entries in the formula is essential. There are two special variables: . and .... Their functionality is identical to that of reshape2:::dcast.

dcast.data.table also allows value.var columns of type list.

When the combination of variables in formula doesn't identify a unique value in a cell, fun.aggregate will have to be used. The aggregating function should take a vector as input and return a single value (or a list of length one) as output. In cases where value.var is a list, the function should be able to handle a list input and provide a single value or list of length one as output.

If the formula's LHS contains the same column more than once, ex: dcast.data.table(DT, x+x~ y), then the answer will have duplicate names. In those cases, the duplicate names are renamed using make.unique so that the key can be set without issues.

Names for columns that are being cast are generated in the same order (separated by a _) from the (unique) values in each column mentioned in the formula RHS.

The only feature that's not implemented from reshape2:::dcast *yet* is the argument margins.

Value

A keyed data.table that has been cast. The key columns are equal to the variables in the formula LHS in the same order.

Note

Currently unimplemented features from reshape2:::dcast:

  1. At the moment, only formula interface is supported. That the argument formula can accept a list of quoted expressions is yet to be implemented.

  2. The argument margins is yet to be implemented.

New features and the ones likely to be added:

  1. Implemented: dcast.data.table allows value.var column to be of type list.

  2. To do: The fun.aggregate could accept more than one function. It could take a list of functions, all of which satisfy the criteria that they take a vector as input and provide a single value as output. This is very useful for scenarios where one needs to calculate multiple summary statistics on a data set and cast simultaneously.

See Also

melt.data.table, https://r-forge.r-project.org/projects/datatable/

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
require(data.table)
require(reshape2)
names(ChickWeight) <- tolower(names(ChickWeight))
DT <- melt(as.data.table(ChickWeight), id=2:4) # calls melt.data.table

# no S3 method yet, have to use "dcast.data.table"
dcast.data.table(DT, time ~ variable, fun=mean)
dcast.data.table(DT, diet ~ variable, fun=mean)
dcast.data.table(DT, diet+chick ~ time, drop=FALSE)
dcast.data.table(DT, diet+chick ~ time, drop=FALSE, fill=0)

# using subset
dcast.data.table(DT, chick ~ time, fun=mean, subset=.(time < 10 & chick < 20))

## Not run: 
# on big data
set.seed(45)
DT <- data.table(aa=sample(1e4, 1e6, TRUE), 
      bb=sample(1e3, 1e6, TRUE), 
      cc = sample(letters, 1e6, TRUE), dd=runif(1e6))
system.time(dcast.data.table(DT, aa ~ cc, fun=sum)) # 0.31 seconds
system.time(dcast.data.table(DT, bb ~ cc, fun=mean)) # 0.11 seconds
# reshape2:::dcast takes 192.1 seconds
system.time(dcast.data.table(DT, aa + bb ~ cc, fun=sum)) # 3.1 seconds

## End(Not run)

Example output

Loading required package: reshape2

Attaching package: 'reshape2'

The following objects are masked from 'package:data.table':

    dcast, melt

    time    weight
 1:    0  41.06000
 2:    2  49.22000
 3:    4  59.95918
 4:    6  74.30612
 5:    8  91.24490
 6:   10 107.83673
 7:   12 129.24490
 8:   14 143.81250
 9:   16 168.08511
10:   18 190.19149
11:   20 209.71739
12:   21 218.68889
   diet   weight
1:    1 102.6455
2:    2 122.6167
3:    3 142.9500
4:    4 135.2627
     diet chick  0  2  4  6   8  10  12  14  16  18  20  21
  1:    1    18 39 35 NA NA  NA  NA  NA  NA  NA  NA  NA  NA
  2:    1    16 41 45 49 51  57  51  54  NA  NA  NA  NA  NA
  3:    1    15 41 49 56 64  68  68  67  68  NA  NA  NA  NA
  4:    1    13 41 48 53 60  65  67  71  70  71  81  91  96
  5:    1     9 42 51 59 68  85  96  90  92  93 100 100  98
 ---                                                       
196:    4    49 40 53 64 85 108 128 152 166 184 203 233 237
197:    4    46 40 52 62 82 101 120 144 156 173 210 231 238
198:    4    50 41 54 67 84 105 122 155 175 205 234 264 264
199:    4    42 42 49 63 84 103 126 160 174 204 234 269 281
200:    4    48 39 50 62 80 104 125 154 170 222 261 303 322
     diet chick  0  2  4  6   8  10  12  14  16  18  20  21
  1:    1    18 39 35  0  0   0   0   0   0   0   0   0   0
  2:    1    16 41 45 49 51  57  51  54   0   0   0   0   0
  3:    1    15 41 49 56 64  68  68  67  68   0   0   0   0
  4:    1    13 41 48 53 60  65  67  71  70  71  81  91  96
  5:    1     9 42 51 59 68  85  96  90  92  93 100 100  98
 ---                                                       
196:    4    49 40 53 64 85 108 128 152 166 184 203 233 237
197:    4    46 40 52 62 82 101 120 144 156 173 210 231 238
198:    4    50 41 54 67 84 105 122 155 175 205 234 264 264
199:    4    42 42 49 63 84 103 126 160 174 204 234 269 281
200:    4    48 39 50 62 80 104 125 154 170 222 261 303 322
   chick  0  2   4   6   8
1:    18 39 35 NaN NaN NaN
2:    16 41 45  49  51  57
3:    15 41 49  56  64  68
4:    13 41 48  53  60  65
5:     9 42 51  59  68  85
Using 'dd' as value column. Use 'value.var' to override
   user  system elapsed 
  0.326   0.008   0.334 
Using 'dd' as value column. Use 'value.var' to override
   user  system elapsed 
  0.146   0.000   0.145 
Using 'dd' as value column. Use 'value.var' to override
   user  system elapsed 
  1.931   1.186   5.169 

data.table documentation built on May 2, 2019, 4:57 p.m.