dMcast: Casts or pivots a long 'data frame' into a wide sparse matrix

Description Usage Arguments Details Value See Also Examples

View source: R/Matrix.utils.R

Description

Similar in function to dcast, but produces a sparse Matrix as an output. Sparse matrices are beneficial for this application because such outputs are often very wide and sparse. Conceptually similar to a pivot operation.

Usage

1
2
3
4
5
6
7
8
9
dMcast(
  data,
  formula,
  fun.aggregate = "sum",
  value.var = NULL,
  as.factors = FALSE,
  factor.nas = TRUE,
  drop.unused.levels = TRUE
)

Arguments

data

a data frame

formula

casting formula, see details for specifics.

fun.aggregate

name of aggregation function. Defaults to 'sum'

value.var

name of column that stores values to be aggregated numerics

as.factors

if TRUE, treat all columns as factors, including

factor.nas

if TRUE, treat factors with NAs as new levels. Otherwise, rows with NAs will receive zeroes in all columns for that factor

drop.unused.levels

should factors have unused levels dropped? Defaults to TRUE, in contrast to model.matrix

Details

Casting formulas are slightly different than those in dcast and follow the conventions of model.matrix. See formula for details. Briefly, the left hand side of the ~ will be used as the grouping criteria. This can either be a single variable, or a group of variables linked using :. The right hand side specifies what the columns will be. Unlike dcast, using the + operator will append the values for each variable as additional columns. This is useful for things such as one-hot encoding. Using : will combine the columns as interactions.

Value

a sparse Matrix

See Also

cast

dcast

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
#Classic air quality example
melt<-function(data,idColumns)
{
  cols<-setdiff(colnames(data),idColumns)
  results<-lapply(cols,function (x) cbind(data[,idColumns],variable=x,value=as.numeric(data[,x])))
  results<-Reduce(rbind,results)
}
names(airquality) <- tolower(names(airquality))
aqm <- melt(airquality, idColumns=c("month", "day"))
dMcast(aqm, month:day ~variable,fun.aggregate = 'mean',value.var='value')
dMcast(aqm, month ~ variable, fun.aggregate = 'mean',value.var='value') 

#One hot encoding
#Preserving numerics
dMcast(warpbreaks,~.)
#Pivoting numerics as well
dMcast(warpbreaks,~.,as.factors=TRUE)

## Not run: 
orders<-data.frame(orderNum=as.factor(sample(1e6, 1e7, TRUE)), 
   sku=as.factor(sample(1e3, 1e7, TRUE)), 
   customer=as.factor(sample(1e4,1e7,TRUE)), 
   state = sample(letters, 1e7, TRUE),
   amount=runif(1e7)) 
# For simple aggregations resulting in small tables, dcast.data.table (and
   reshape2) will be faster
system.time(a<-dcast.data.table(as.data.table(orders),sku~state,sum,
   value.var = 'amount')) # .5 seconds 
system.time(b<-reshape2::dcast(orders,sku~state,sum,
   value.var = 'amount')) # 2.61 seconds 
system.time(c<-dMcast(orders,sku~state,
   value.var = 'amount')) # 8.66 seconds 
   
# However, this situation changes as the result set becomes larger 
system.time(a<-dcast.data.table(as.data.table(orders),customer~sku,sum,
   value.var = 'amount')) # 4.4 seconds 
system.time(b<-reshape2::dcast(orders,customer~sku,sum,
   value.var = 'amount')) # 34.7 seconds 
 system.time(c<-dMcast(orders,customer~sku,
   value.var = 'amount')) # 14.55 seconds 
   
# More complicated: 
system.time(a<-dcast.data.table(as.data.table(orders),customer~sku+state,sum,
   value.var = 'amount')) # 16.96 seconds, object size = 2084 Mb 
system.time(b<-reshape2::dcast(orders,customer~sku+state,sum,
   value.var = 'amount')) # Does not return 
system.time(c<-dMcast(orders,customer~sku:state,
   value.var = 'amount')) # 21.53 seconds, object size = 116.1 Mb

system.time(a<-dcast.data.table(as.data.table(orders),orderNum~sku,sum,
   value.var = 'amount')) # Does not return 
system.time(c<-dMcast(orders,orderNum~sku,
   value.var = 'amount')) # 24.83 seconds, object size = 175Mb

system.time(c<-dMcast(orders,sku:state~customer,
   value.var = 'amount')) # 17.97 seconds, object size = 175Mb
       

## End(Not run)

Matrix.utils documentation built on March 26, 2020, 5:52 p.m.