build_hierarchy: Build hierarchy
In jangorecki/dwtools: Data Warehouse related functions.

Description Usage Arguments Details Value See Also Examples

Detect hierarchies in the dataset and normalize to star schema, optionally deploy new model to db.

build_hierarchy(x, factname = "fact", dimnames = "auto", setkey = TRUE,
  deploy = FALSE, db.conn.name, .db.conns = getOption("dwtools.conns"),
  timing = getOption("dwtools.timing"),
  verbose = getOption("dwtools.verbose"))

`x`	data.table source dataset.
`factname`	character, default fact.
`dimnames`	character currently only auto supported, default. Dimension names will be created based on the common words in the fields which forms the dimension.
`setkey`	logical if return tables with keys.
`deploy`	logical, TRUE will overwrite the data in the target tables in connection db.conn.name.
`db.conn.name`	character deploy db connection name.
`.db.conns`	list of connections uniquely named. See db function.
`timing`	logical measure timing for vectorized usage, read timing, for single row timing summary use `timing(build_hierarchy(...))`.
`verbose`	integer, if greater than 0 then print debugging messages.

Only basic star schema normalization will be created. All numeric fields is considered as measures, others as dimensions (including integer fields). Allocation of columns to dimension is performed based on the groupings count unique of all variable pairs. See cardinality element of the result for cardinality matrix. Due to extensive computional processing of the function be aware it can take some time to return the results. You can use timing argument to register sub processes time or verbose to display processing messages.

List of: tables (multiple normalized R data.tables) cardinality matrix represents groupings between all columns (computionally extensive for big datasets) like length(unq(col1))/nrow(unique(data.table(col1,col2))). lists of parents (including same entity attrs with any cardinality) and list of childs (including same entity attrs with exact cardinality) for each field.

joinbyv, db, timing

suppressPackageStartupMessages(library(dwtools))

## simple 2 dimension case

X = dw.populate(N=1e5, scenario="star")
x <- joinbyv(X$SALES, join=list(X$CURRENCY, X$GEOGRAPHY))
dw <- build_hierarchy(x, factname="fact_sales")
sapply(dw$tables,ncol)
sapply(dw$tables,nrow)

## 5 dimensions

x <- dw.populate(N=1e5, scenario="denormalize")
names(x)
system.time(dw <- build_hierarchy(x,factname="fact_sales"))
# cardinality matrix, top 10x10
dw$cardinality[1:10,1:10]
# normalized tables into star schema
lapply(dw$tables,head,3)
# relation defintion, processing meta data also available
names(dw)

# better timing, setup ?db connection to automatically store logs in db
dw <- build_hierarchy(x,factname="fact_sales",timing=TRUE)
# print timing expressions and the rest of log entry
get.timing(FALSE,last=2L)[,{cat(paste0("\n# ",tag,"\n",expr),"\n",sep=""); .SD}][,.SD,.SDcols=-c("expr")]

## shiny app to browse model, will use following vars in Global Env:
# x - denormalized table
# dw - normalization results
if(interactive()) shiny::runApp(system.file("shinyBI", package="dwtools"))

## deploy to db

library(RSQLite)
sqlite1 = list(drvName="SQLite",dbname="sqlite1.db") # setup connection to db
sqlite1$conn = dbConnect(SQLite(), dbname=sqlite1$dbname)
options("dwtools.db.conns"=list(sqlite1=sqlite1))
x <- dw.populate(N=1e5, scenario="denormalize")
dw <- build_hierarchy(x,factname="fact_sales",deploy=TRUE,db.conn.name="sqlite1",timing=TRUE,verbose=1L)
get.timing()
db(c("dim_geog","fact_sales"))
dbDisconnect(sqlite1$conn); file.remove(sqlite1$dbname); options("dwtools.db.conns"=NULL); purge.timing()

## different data quality use cases tests:

# reorder columns
#setcolorder(x,sample(names(x),length(x),FALSE)) # random column order

# alter column name to disable common_words feature
#setnames(x,c("curr_code","curr_type"),c("curr1_code","curr2_type")) # no common words within fields in dimension

# broken hierarchy
#`:=`

# NAs
#`:=`