Guide on Anchor Modeling in R

You should have already installed anchormodeling R package. For configuring fresh linux environment see Setup Guide.

Define domain

Inside the package there is already working example model for Stage performances domain, see ?actor.am, ?actor.data to populate model/data.

In this document I will define basic model for git repository data.
You can use git's traveling in time ability to easily produce data which evolves over time.
You can follow example locally on your machines.

Extract data from git

In the working directory create subdir for git clone your repo.
I will use bitcoin repo as it is a good example of distributed project: 300 contributors, 8400 commits, 4000 pull requests.

You can use shell script to generate source data directly from git repo.

Initial load to DW.
Limited by --before="2014-01-01".

mkdir wd
cd wd
git clone https://github.com/bitcoin/bitcoin
cd bitcoin
# modified https://gist.github.com/textarcana/1306223
git log \
    --before="2014-01-01" \
    --pretty=format:'{%n  "commit": "%H",%n  "author": "%an",%n  "author_email": "%ae",%n  "timestamp": "%at",%n  "message": "%f"%n},' \
    $@ | \
    perl -pe 'BEGIN{print "["}; END{print "]\n"}' | \
    perl -pe 's/},]/}]/' > ../commits2013.json

Data for incremental loading of DW.
Limited by --before="2015-01-01".

git log \
    --before="2015-01-01" \
    --pretty=format:'{%n  "commit": "%H",%n  "author": "%an",%n  "author_email": "%ae",%n  "timestamp": "%at",%n  "message": "%f"%n},' \
    $@ | \
    perl -pe 'BEGIN{print "["}; END{print "]\n"}' | \
    perl -pe 's/},]/}]/' > ../commits2014.json

Model evolution.
Add individual file insertions/deletions details - git log --numstat.

git log \
    --before="2015-01-01" \
    --numstat \
    --format='%H' \
    $@ | \
    perl -lawne '
        if (defined $F[1]) {
            print qq#{"insertions": "$F[0]", "deletions": "$F[1]", "path": "$F[2]"},#
        } elsif (defined $F[0]) {
            print qq#],\n"$F[0]": [#
        };
        END{print qq#],#}' | \
    tail -n +2 | \
    perl -wpe 'BEGIN{print "{"}; END{print "}"}' | \
    tr '\n' ' ' | \
    perl -wpe 's#(]|}),\s*(]|})#$1$2#g' | \
    perl -wpe 's#,\s*?}$#}#' > ../numstats2014.json
src_files <- c("commits2013.json","commits2014.json","numstats2014.json")
src_exists <- sapply(src_files, file.exists)
if(!all(src_exists)){
    stop(paste0("You need to populate source data in your working directory, missing files: ",paste(src_files[!src_exists], collapse=", "),". Follow *anchormodeling* vignette."))
}

In new session navigate to your wd directory and run R.

R

Source data json files are ready to be picked up.
Can be easiy loaded using jsonlite.

#library(devtools)
#load_all()
library(anchormodeling)
library(jsonlite)
extract.commits <- function(x){
    stopifnot(file.exists(x))
    char_cols <- c("commit","author","author_email","message") # handle encoding
    setDT(fromJSON(x))[, `:=`(timestamp = as.POSIXct(as.integer(timestamp), origin="1970-01-01"))
                       ][, c(char_cols) := lapply(.SD, function(x) {Encoding(x) <- "unknown"; x}), .SDcols = char_cols]
}
extract.numstats <- function(x){
    stopifnot(file.exists(x))
    char_cols <- c("commit","path") # handle encoding
    rbindlist(fromJSON(x), idcol = "commit")[insertions=="-", insertions := NA_character_
                                             ][deletions=="-", deletions := NA_character_
                                               ][, `:=`(insertions=as.integer(insertions), deletions=as.integer(deletions))
                                                 ][, c(char_cols) := lapply(.SD, function(x) {Encoding(x) <- "unknown"; x}), .SDcols = char_cols]
}

Define Anchor Model

am <- AM$new()
am$add$A(mne = "AU", desc = "Author")
am$add$a(mne = "NAM", desc = "Name", hist = TRUE, anchor = "AU")
am$add$a(mne = "EMA", desc = "Email", anchor = "AU")
am$add$A(mne = "CM", desc = "Commit")
am$add$a(mne = "TIM", desc = "Time", anchor = "CM")
am$add$a(mne = "MES", desc = "Message", anchor = "CM")
am$add$a(mne = "TAG", desc = "Tag", anchor = "CM")
am$add$a(mne = "HSH", desc = "Hash", anchor = "CM")
am$add$t(anchors = c("AU","CM"), roles = c("authoring","of"), identifier = c(1,Inf), hist=TRUE)

Run AM Data Warehouse instance

am$run()

Load data to DW

You need to define mapping.
Each symbol nested in below list substitutes the character scalar provided by user while defining mapping.

mapping <- list(anchor1_mne = list(natural_key_column_names,
                                   attr1_mne = column_name,
                                   attr2_mne = c(column_name, "hist" = historize_column_name),
                                   attr3_mne = column_name,
                                   attr4_mne = c(column_name, "hist" = historize_column_name)),
                anchor2_mne = list(natural_key_column_names,
                                   attr1_mne = column_name,
                                   attr2_mne = c(column_name, "hist" = historize_column_name),
                                   attr3_mne = column_name,
                                   attr4_mne = c(column_name, "hist" = historize_column_name)),
                tie_mne1_mne2_mne2 = list("knot" = knot_column_name,
                                          "hist" = historize_column_name))

If somebody wants to learn new function function for everything then it can used by A() for anchors and a() for attributes.

mapping <- list(Amne = A(natural_key_column_names,
                         attr_mne = a(column_name, hist = "historize_column_name"),
                         ...),
                ...)

On our simple git repo data model.

mapping <- list(CM = list("commit",
                          MES = "message",
                          TIM = "timestamp",
                          HSH = "commit"),
                AU = list("author_email", # natural key
                          NAM = c("author", hist = "timestamp"),
                          EMA = "author_email"),
                AU_CM = list(hist = "timestamp"))

meta argument is used as processing batch metadata. Can be integer meta or list, see example.

commits1 <- extract.commits("commits2013.json")
am$load(mapping, commits1, meta = 1L)
am$load(mapping, commits1, meta = list(meta = 2L, user = "manual", src = "git log")) # log custom details

Duplicate inserts were handled automatically.
To controling temporal duplicates use restatement feature when create attributes/knots by rest = FALSE argument or at start options("am.restatability" = FALSE) globally.

Query data

3NF are not yet ready.

am$view("AU", type = "current") # default type
am$view("AU", type = "latest")
am$view("AU", type = "timepoint", time = as.POSIXct("2011-01-01 00:00:00", origin="1970-01-01"))
# am$view("AU", type = "difference", time = c(as.POSIXct("2012-01-01 00:00:00", origin="1970-01-01"), as.POSIXct("2012-02-01 00:00:00", origin="1970-01-01")))

Query metadata

Check all the below commands.

am
am$etl # you should notice your OS user name on `meta==1L`
am$log
am$IM()

Save AM instance

# saving
am$stop()
save(am, mapping, extract.commits, extract.numstats, file = "git-am.RData")

# now you can shutdown your machine
rm(am, mapping, extract.commits, extract.numstats)

# loading
load("git-am.RData")
am$run()

Export model

am$xml()

File can be loaded in official Anchor Modeler Test: roenbaeck.github.io/anchor
After loading xml file don't forget to click Play button.
Exported xml does not contain non-model fields like data types. Also it won't export restatability definition as it is part of metadata of model in current XML schema. This may be subject to change in Anchor Modeling, see anchor#4.

Current model can look like that.

Git repo AM first iteration

If you don't like how your model looks like you can use Layout menu and Release all fixed or Randomize layout options.

Incremental loading

commits2 <- extract.commits("commits2014.json")
am$load(mapping, commits2, meta = list(meta = 3L, src = "git log"))
am
am$etl
am$log
am$IM()

Evolving model

Insertions and deletions to files.
FileCommit anchor built on composite key.

am$add$A(mne = "FI", desc = "FileEvolution")
am$add$a(mne = "PAT", desc = "Path", anchor = "FI")
am$add$a(mne = "INS", desc = "Insertions", anchor = "FI")
am$add$a(mne = "DEL", desc = "Deletions", anchor = "FI")
am$add$t(anchors = c("FI","CM"), roles = c("changed","in"), identifier = c(Inf,Inf))
mapping.numstat <- list(FI = list(c("commit","path"),
                                  PAT = "path",
                                  INS = "insertions",
                                  DEL = "deletions"),
                        CM = list("commit"),
                        FI_CM = list())
am$run()
numstats1 <- extract.numstats("numstats2014.json")
am$load(mapping.numstat, numstats1, meta = list(meta = 4L, src = "git log"))
am$view("FI")
am
am$etl
am$log
am$IM()

Questions

data.table design assumed from the early days to handle time series data well. Thanks to clustered key it can perform multiple types of joins very efficiently: equi join, rolling join (cross apply/cross join lateral), overlapping join (range join), nonequi join can be done by overlapping join or cross join and filter. See implementation doc for details.

Model any data. Your model can have even single anchor. Model can non-destructivly grow over time so you can think about more entities later!
In this doc you have example on loading git repo data, you can reproduce the same process for any git repo you want.
If you have some DWH data already, you can just use a minimal subset of it.

In development, ready for testing.
You are welcome to add your unit tests which can be included in automated unit testing on package build.

Importants notes vs SQL anchor model



jangorecki/anchormodeling documentation built on May 18, 2019, 12:24 p.m.