In jangorecki/data.cube: OLAP cube data type

Subset and aggregate multidimensional data with `data.cube`

Jan Gorecki, 2016-05-04

R data.cube class defined in data.cube package.

Characteristic of `[.data.cube`

what's the same in `[.array`

subsetting dimensions by keys
handling drop argument
returns same class object, so it is chain-able

what's more than `[.array`

can subset on hierarchy attributes of dimensions
scale with dimensions and their cardinalities
store multiple measures, predefined formatting
high performance thanks to data.table
can aggregate measures by dropping chosen dimensions

what's different to `[.array`

array can be subsetted by dimension keys but also by integer position
in data.cube subset directly by integer position is not possible, but you can have integer surrogate keys which can refer to any custom integer positions
dimension levels are always ordered

Start session

If not installed use install.packages("data.cube", repos = paste0("https://", c("jangorecki.gitlab.io/data.cube","Rdatatable.github.io/data.table","cran.rstudio.com"))).

library(data.table)
library(data.cube)

Tiny example

2x2x2 array vs data.cube.

set.seed(1)
# array
ar = array(rnorm(8,10,5), rep(2,3), 
           dimnames = list(color = c("green","red"), 
                           year = c("2014","2015"), 
                           country = c("IN","UK"))) # sorted
# cube normalized to star schema just on natural keys
dc = as.data.cube(ar)
# slice
ar["green", "2015",]
dc["green", "2015"]
format(dc["green", "2015"])
# dice
ar[c("green","red"), c("2014","2015"),]
dc[c("green","red"), c("2014","2015")]
format(dc[c("green","red"), c("2014","2015")])

# exact tabular representation of array is just a formatting on the cube
ar["green", c("2014","2015"),]
format(dc["green", c("2014","2015")], 
       dcast = TRUE, 
       formula = year ~ country)

ar[,"2015", c("UK","IN")]
format(dc[,"2015", c("UK","IN")], 
       dcast = TRUE, 
       formula = color ~ country) # sorted dimensions levels

Hierarchies example

Filtering on attributes in hierarchies of dimension.

# populate fact, dimensions and hierarchies
X = populate_star(N=1e5)
lapply(X[1:2], sapply, ncol)
lapply(X[1:2], sapply, nrow)
str(X[3L], max.level=3L) # hierarchy defined as list of levels and attributes on each level
dc = as.data.cube(X) # as.data.cube.list
print(dc)

# slice
dc["Mazda RX4"]
dc["Mazda RX4",,"BTC"]
# dice
dc[,, c("CNY","BTC"), c("GA","IA","AD")]

# custom order of arguments matching by dimension name
dc[geography = c("GA","IA","AD"), currency = c("CNY","BTC")]

# check dimensions
names(dc$dimensions)
# foreign keys and measures
names(dc)

# use dimensions hierarchy attributes for slice and dice, mix filters from various levels in hierarchy
dc["Mazda RX4",, .(curr_type = "crypto"),, .(time_year = 2014L, time_quarter_name = c("Q1","Q2"))]
# same as above but more verbose
dc[product = "Mazda RX4",
   customer = .(),
   currency = .(curr_type = "crypto"),
   geography = .(),
   time = .(time_year = 2014L, time_quarter_name = c("Q1","Q2"))]

# data.cube `[` operator returns another cube so queries can be chained
dc[,,, .(geog_region_name = "North Central")
   ][,,, .(geog_abb = c("IA","NV","MO")), .(time_year = 2014L)
     ]

Aggregates

Collapse dimension to sub-aggregate cube can be done with `-` quoted symbol. It is possible to filter out on that dimension before collapsing it using filter similarly to .(...) as `-`(...).
Filtering on attributes and group by customer and currency.

dc[product = `-`,
   customer = .(),
   currency = .(curr_type = "crypto"),
   geography = `-`,
   time = `-`(time_year = 2013:2014)]

Scalability

Memory data.cube vs array.

# ~1e5 facts for 5 dims of cardinalities: 32, 32, 49, 50, 1826
dc = as.data.cube(populate_star(N=1e5))
## estimated size of memory required to store an base R `array` for single numeric measure
sprintf("array: %.2f GB", (prod(dim(dc)) * 8)/(1024^3))
## fact table of *cube* object having multiple measures
sprintf("data.cube: %.2f GB", as.numeric(object.size(dc$fact$data))/(1024^3))

# ~1e6 facts for 5 dims of cardinalities: 32, 32, 49, 50, 1826
dc = as.data.cube(populate_star(N=1e6))
## estimated size of memory required to store an base R `array` for single numeric measure
sprintf("array: %.2f GB", (prod(dim(dc)) * 8)/(1024^3))
## fact table of *cube* object having multiple measures
sprintf("data.cube: %.2f GB", as.numeric(object.size(dc$fact$data))/(1024^3))

# ~1e6 facts for 5 dims of cardinalities: 32, 32, 49, 50, 3652
# twice bigger time dimension
dc = as.data.cube(populate_star(N=1e6, Y = c(2005L,2014L)))
## estimated size of memory required to store an base R `array` for single numeric measure
sprintf("array: %.2f GB", (prod(dim(dc)) * 8)/(1024^3))
## fact table of *cube* object having multiple measures
sprintf("data.cube: %.2f GB", as.numeric(object.size(dc$fact$data))/(1024^3))

Memory cap is the most common issue on scalability. The is addressed by sharding fact table using big.data.table class instead of data.table in fact class object. CI process includes basic testing of that feature on old cube class, it wasn't yet implemented for data.cube.

R single CPU limitation is now being addressed by parallelizing some data.table calls. This is now being developed in quite a low level and pretty well scalable way.
Sharding can also outsource CPU computation to nodes.

jangorecki/data.cube documentation built on Aug. 22, 2019, 4:15 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jangorecki/data.cube
OLAP cube data type

In jangorecki/data.cube: OLAP cube data type

Subset and aggregate multidimensional data with `data.cube`

Characteristic of `[.data.cube`

what's the same in `[.array`

what's more than `[.array`

what's different to `[.array`

Start session

Tiny example

Hierarchies example

Aggregates

Scalability

R Package Documentation

Browse R Packages

We want your feedback!

jangorecki/data.cube OLAP cube data type

In jangorecki/data.cube: OLAP cube data type

Subset and aggregate multidimensional data with data.cube

Characteristic of [.data.cube

what's the same in [.array

what's more than [.array

what's different to [.array

Start session

Tiny example

Hierarchies example

Aggregates

Scalability

R Package Documentation

Browse R Packages

We want your feedback!

jangorecki/data.cube
OLAP cube data type

Subset and aggregate multidimensional data with `data.cube`

Characteristic of `[.data.cube`

what's the same in `[.array`

what's more than `[.array`

what's different to `[.array`