In jangorecki/anchormodeling: Anchor Modeling

Guide on Anchor Modeling in R

You should have already installed anchormodeling R package. For configuring fresh linux environment see Setup Guide.

Define domain

Inside the package there is already working example model for Stage performances domain, see ?actor.am, ?actor.data to populate model/data.

In this document I will define basic model for git repository data.
You can use git's traveling in time ability to easily produce data which evolves over time.
You can follow example locally on your machines.

Extract data from git

In the working directory create subdir for git clone your repo.
I will use bitcoin repo as it is a good example of distributed project: 300 contributors, 8400 commits, 4000 pull requests.

You can use shell script to generate source data directly from git repo.

Initial load to DW.
Limited by --before="2014-01-01".

mkdir wd
cd wd
git clone https://github.com/bitcoin/bitcoin
cd bitcoin
# modified https://gist.github.com/textarcana/1306223
git log \
    --before="2014-01-01" \
    --pretty=format:'{%n  "commit": "%H",%n  "author": "%an",%n  "author_email": "%ae",%n  "timestamp": "%at",%n  "message": "%f"%n},' \
    $@ | \
    perl -pe 'BEGIN{print "["}; END{print "]\n"}' | \
    perl -pe 's/},]/}]/' > ../commits2013.json

Data for incremental loading of DW.
Limited by --before="2015-01-01".

git log \
    --before="2015-01-01" \
    --pretty=format:'{%n  "commit": "%H",%n  "author": "%an",%n  "author_email": "%ae",%n  "timestamp": "%at",%n  "message": "%f"%n},' \
    $@ | \
    perl -pe 'BEGIN{print "["}; END{print "]\n"}' | \
    perl -pe 's/},]/}]/' > ../commits2014.json

Model evolution.
Add individual file insertions/deletions details - git log --numstat.

git log \
    --before="2015-01-01" \
    --numstat \
    --format='%H' \
    $@ | \
    perl -lawne '
        if (defined $F[1]) {
            print qq#{"insertions": "$F[0]", "deletions": "$F[1]", "path": "$F[2]"},#
        } elsif (defined $F[0]) {
            print qq#],\n"$F[0]": [#
        };
        END{print qq#],#}' | \
    tail -n +2 | \
    perl -wpe 'BEGIN{print "{"}; END{print "}"}' | \
    tr '\n' ' ' | \
    perl -wpe 's#(]|}),\s*(]|})#$1$2#g' | \
    perl -wpe 's#,\s*?}$#}#' > ../numstats2014.json

src_files <- c("commits2013.json","commits2014.json","numstats2014.json")
src_exists <- sapply(src_files, file.exists)
if(!all(src_exists)){
    stop(paste0("You need to populate source data in your working directory, missing files: ",paste(src_files[!src_exists], collapse=", "),". Follow *anchormodeling* vignette."))
}

In new session navigate to your wd directory and run R.

Source data json files are ready to be picked up.
Can be easiy loaded using jsonlite.

#library(devtools)
#load_all()
library(anchormodeling)
library(jsonlite)

extract.commits <- function(x){
    stopifnot(file.exists(x))
    char_cols <- c("commit","author","author_email","message") # handle encoding
    setDT(fromJSON(x))[, `:=`(timestamp = as.POSIXct(as.integer(timestamp), origin="1970-01-01"))
                       ][, c(char_cols) := lapply(.SD, function(x) {Encoding(x) <- "unknown"; x}), .SDcols = char_cols]
}

extract.numstats <- function(x){
    stopifnot(file.exists(x))
    char_cols <- c("commit","path") # handle encoding
    rbindlist(fromJSON(x), idcol = "commit")[insertions=="-", insertions := NA_character_
                                             ][deletions=="-", deletions := NA_character_
                                               ][, `:=`(insertions=as.integer(insertions), deletions=as.integer(deletions))
                                                 ][, c(char_cols) := lapply(.SD, function(x) {Encoding(x) <- "unknown"; x}), .SDcols = char_cols]
}

Define Anchor Model

am <- AM$new()
am$add$A(mne = "AU", desc = "Author")
am$add$a(mne = "NAM", desc = "Name", hist = TRUE, anchor = "AU")
am$add$a(mne = "EMA", desc = "Email", anchor = "AU")
am$add$A(mne = "CM", desc = "Commit")
am$add$a(mne = "TIM", desc = "Time", anchor = "CM")
am$add$a(mne = "MES", desc = "Message", anchor = "CM")
am$add$a(mne = "TAG", desc = "Tag", anchor = "CM")
am$add$a(mne = "HSH", desc = "Hash", anchor = "CM")
am$add$t(anchors = c("AU","CM"), roles = c("authoring","of"), identifier = c(1,Inf), hist=TRUE)

Run AM Data Warehouse instance

am$run()

Load data to DW

You need to define mapping.
Each symbol nested in below list substitutes the character scalar provided by user while defining mapping.

mapping <- list(anchor1_mne = list(natural_key_column_names,
                                   attr1_mne = column_name,
                                   attr2_mne = c(column_name, "hist" = historize_column_name),
                                   attr3_mne = column_name,
                                   attr4_mne = c(column_name, "hist" = historize_column_name)),
                anchor2_mne = list(natural_key_column_names,
                                   attr1_mne = column_name,
                                   attr2_mne = c(column_name, "hist" = historize_column_name),
                                   attr3_mne = column_name,
                                   attr4_mne = c(column_name, "hist" = historize_column_name)),
                tie_mne1_mne2_mne2 = list("knot" = knot_column_name,
                                          "hist" = historize_column_name))

If somebody wants to learn new function function for everything then it can used by A() for anchors and a() for attributes.

mapping <- list(Amne = A(natural_key_column_names,
                         attr_mne = a(column_name, hist = "historize_column_name"),
                         ...),
                ...)

On our simple git repo data model.

mapping <- list(CM = list("commit",
                          MES = "message",
                          TIM = "timestamp",
                          HSH = "commit"),
                AU = list("author_email", # natural key
                          NAM = c("author", hist = "timestamp"),
                          EMA = "author_email"),
                AU_CM = list(hist = "timestamp"))

meta argument is used as processing batch metadata. Can be integer meta or list, see example.

commits1 <- extract.commits("commits2013.json")
am$load(mapping, commits1, meta = 1L)
am$load(mapping, commits1, meta = list(meta = 2L, user = "manual", src = "git log")) # log custom details

Duplicate inserts were handled automatically.
To controling temporal duplicates use restatement feature when create attributes/knots by rest = FALSE argument or at start options("am.restatability" = FALSE) globally.

Query data

3NF are not yet ready.

am$view("AU", type = "current") # default type
am$view("AU", type = "latest")
am$view("AU", type = "timepoint", time = as.POSIXct("2011-01-01 00:00:00", origin="1970-01-01"))
# am$view("AU", type = "difference", time = c(as.POSIXct("2012-01-01 00:00:00", origin="1970-01-01"), as.POSIXct("2012-02-01 00:00:00", origin="1970-01-01")))

Query metadata

Check all the below commands.

am
am$etl # you should notice your OS user name on `meta==1L`
am$log
am$IM()

Save AM instance

# saving
am$stop()
save(am, mapping, extract.commits, extract.numstats, file = "git-am.RData")

# now you can shutdown your machine
rm(am, mapping, extract.commits, extract.numstats)

# loading
load("git-am.RData")
am$run()

Export model

am$xml()

File can be loaded in official Anchor Modeler Test: roenbaeck.github.io/anchor
After loading xml file don't forget to click Play button.
Exported xml does not contain non-model fields like data types. Also it won't export restatability definition as it is part of metadata of model in current XML schema. This may be subject to change in Anchor Modeling, see anchor#4.

Current model can look like that.

Git repo AM first iteration

If you don't like how your model looks like you can use Layout menu and Release all fixed or Randomize layout options.

Incremental loading

commits2 <- extract.commits("commits2014.json")
am$load(mapping, commits2, meta = list(meta = 3L, src = "git log"))

am
am$etl
am$log
am$IM()

Evolving model

Insertions and deletions to files.
FileCommit anchor built on composite key.

am$add$A(mne = "FI", desc = "FileEvolution")
am$add$a(mne = "PAT", desc = "Path", anchor = "FI")
am$add$a(mne = "INS", desc = "Insertions", anchor = "FI")
am$add$a(mne = "DEL", desc = "Deletions", anchor = "FI")
am$add$t(anchors = c("FI","CM"), roles = c("changed","in"), identifier = c(Inf,Inf))
mapping.numstat <- list(FI = list(c("commit","path"),
                                  PAT = "path",
                                  INS = "insertions",
                                  DEL = "deletions"),
                        CM = list("commit"),
                        FI_CM = list())
am$run()

numstats1 <- extract.numstats("numstats2014.json")
am$load(mapping.numstat, numstats1, meta = list(meta = 4L, src = "git log"))

am$view("FI")

am
am$etl
am$log
am$IM()

Questions

Why it is so fast?

data.table design assumed from the early days to handle time series data well. Thanks to clustered key it can perform multiple types of joins very efficiently: equi join, rolling join (cross apply/cross join lateral), overlapping join (range join), nonequi join can be done by overlapping join or cross join and filter. See implementation doc for details.

How to start?

Model any data. Your model can have even single anchor. Model can non-destructivly grow over time so you can think about more entities later!
In this doc you have example on loading git repo data, you can reproduce the same process for any git repo you want.
If you have some DWH data already, you can just use a minimal subset of it.

Stability of project?

In development, ready for testing.
You are welcome to add your unit tests which can be included in automated unit testing on package build.

Waiting for more questions to add here

Importants notes vs SQL anchor model

data stored in-memory
built-in ETL
triggers are replaced by functions
built-in Identity Management
built-in dashboard
high performance data processing using data.table
does not support concurrent-reliance-temporal
easy multiple AM instances in R session

jangorecki/anchormodeling documentation built on May 18, 2019, 12:24 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

jangorecki/anchormodeling
Anchor Modeling

In jangorecki/anchormodeling: Anchor Modeling

Guide on Anchor Modeling in R

Define domain

Extract data from git

Define Anchor Model

Run AM Data Warehouse instance

Load data to DW

Query data

Query metadata

Save AM instance

Export model

Incremental loading

Evolving model

Questions

Importants notes vs SQL anchor model

R Package Documentation

Browse R Packages

We want your feedback!

jangorecki/anchormodeling Anchor Modeling

In jangorecki/anchormodeling: Anchor Modeling

Guide on Anchor Modeling in R

Define domain

Extract data from git

Define Anchor Model

Run AM Data Warehouse instance

Load data to DW

Query data

Query metadata

Save AM instance

Export model

Incremental loading

Evolving model

Questions

Importants notes vs SQL anchor model

R Package Documentation

Browse R Packages

We want your feedback!

jangorecki/anchormodeling
Anchor Modeling