You should have already installed anchormodeling R package. For configuring fresh linux environment see Setup Guide.
Inside the package there is already working example model for Stage performances domain, see ?actor.am
, ?actor.data
to populate model/data.
In this document I will define basic model for git repository data.
You can use git's traveling in time ability to easily produce data which evolves over time.
You can follow example locally on your machines.
In the working directory create subdir for git clone your repo.
I will use bitcoin repo as it is a good example of distributed project: 300 contributors, 8400 commits, 4000 pull requests.
You can use shell script to generate source data directly from git repo.
Initial load to DW.
Limited by --before="2014-01-01"
.
mkdir wd cd wd git clone https://github.com/bitcoin/bitcoin cd bitcoin # modified https://gist.github.com/textarcana/1306223 git log \ --before="2014-01-01" \ --pretty=format:'{%n "commit": "%H",%n "author": "%an",%n "author_email": "%ae",%n "timestamp": "%at",%n "message": "%f"%n},' \ $@ | \ perl -pe 'BEGIN{print "["}; END{print "]\n"}' | \ perl -pe 's/},]/}]/' > ../commits2013.json
Data for incremental loading of DW.
Limited by --before="2015-01-01"
.
git log \ --before="2015-01-01" \ --pretty=format:'{%n "commit": "%H",%n "author": "%an",%n "author_email": "%ae",%n "timestamp": "%at",%n "message": "%f"%n},' \ $@ | \ perl -pe 'BEGIN{print "["}; END{print "]\n"}' | \ perl -pe 's/},]/}]/' > ../commits2014.json
Model evolution.
Add individual file insertions/deletions details - git log --numstat
.
git log \ --before="2015-01-01" \ --numstat \ --format='%H' \ $@ | \ perl -lawne ' if (defined $F[1]) { print qq#{"insertions": "$F[0]", "deletions": "$F[1]", "path": "$F[2]"},# } elsif (defined $F[0]) { print qq#],\n"$F[0]": [# }; END{print qq#],#}' | \ tail -n +2 | \ perl -wpe 'BEGIN{print "{"}; END{print "}"}' | \ tr '\n' ' ' | \ perl -wpe 's#(]|}),\s*(]|})#$1$2#g' | \ perl -wpe 's#,\s*?}$#}#' > ../numstats2014.json
src_files <- c("commits2013.json","commits2014.json","numstats2014.json") src_exists <- sapply(src_files, file.exists) if(!all(src_exists)){ stop(paste0("You need to populate source data in your working directory, missing files: ",paste(src_files[!src_exists], collapse=", "),". Follow *anchormodeling* vignette.")) }
In new session navigate to your wd
directory and run R.
R
Source data json files are ready to be picked up.
Can be easiy loaded using jsonlite.
#library(devtools) #load_all() library(anchormodeling) library(jsonlite)
extract.commits <- function(x){ stopifnot(file.exists(x)) char_cols <- c("commit","author","author_email","message") # handle encoding setDT(fromJSON(x))[, `:=`(timestamp = as.POSIXct(as.integer(timestamp), origin="1970-01-01")) ][, c(char_cols) := lapply(.SD, function(x) {Encoding(x) <- "unknown"; x}), .SDcols = char_cols] }
extract.numstats <- function(x){ stopifnot(file.exists(x)) char_cols <- c("commit","path") # handle encoding rbindlist(fromJSON(x), idcol = "commit")[insertions=="-", insertions := NA_character_ ][deletions=="-", deletions := NA_character_ ][, `:=`(insertions=as.integer(insertions), deletions=as.integer(deletions)) ][, c(char_cols) := lapply(.SD, function(x) {Encoding(x) <- "unknown"; x}), .SDcols = char_cols] }
am <- AM$new() am$add$A(mne = "AU", desc = "Author") am$add$a(mne = "NAM", desc = "Name", hist = TRUE, anchor = "AU") am$add$a(mne = "EMA", desc = "Email", anchor = "AU") am$add$A(mne = "CM", desc = "Commit") am$add$a(mne = "TIM", desc = "Time", anchor = "CM") am$add$a(mne = "MES", desc = "Message", anchor = "CM") am$add$a(mne = "TAG", desc = "Tag", anchor = "CM") am$add$a(mne = "HSH", desc = "Hash", anchor = "CM") am$add$t(anchors = c("AU","CM"), roles = c("authoring","of"), identifier = c(1,Inf), hist=TRUE)
am$run()
You need to define mapping.
Each symbol nested in below list substitutes the character scalar provided by user while defining mapping.
mapping <- list(anchor1_mne = list(natural_key_column_names, attr1_mne = column_name, attr2_mne = c(column_name, "hist" = historize_column_name), attr3_mne = column_name, attr4_mne = c(column_name, "hist" = historize_column_name)), anchor2_mne = list(natural_key_column_names, attr1_mne = column_name, attr2_mne = c(column_name, "hist" = historize_column_name), attr3_mne = column_name, attr4_mne = c(column_name, "hist" = historize_column_name)), tie_mne1_mne2_mne2 = list("knot" = knot_column_name, "hist" = historize_column_name))
If somebody wants to learn new function function for everything then it can used by A()
for anchors and a()
for attributes.
mapping <- list(Amne = A(natural_key_column_names, attr_mne = a(column_name, hist = "historize_column_name"), ...), ...)
On our simple git repo data model.
mapping <- list(CM = list("commit", MES = "message", TIM = "timestamp", HSH = "commit"), AU = list("author_email", # natural key NAM = c("author", hist = "timestamp"), EMA = "author_email"), AU_CM = list(hist = "timestamp"))
meta
argument is used as processing batch metadata. Can be integer meta
or list, see example.
commits1 <- extract.commits("commits2013.json") am$load(mapping, commits1, meta = 1L) am$load(mapping, commits1, meta = list(meta = 2L, user = "manual", src = "git log")) # log custom details
Duplicate inserts were handled automatically.
To controling temporal duplicates use restatement feature when create attributes/knots by rest = FALSE
argument or at start options("am.restatability" = FALSE)
globally.
3NF are not yet ready.
am$view("AU", type = "current") # default type am$view("AU", type = "latest") am$view("AU", type = "timepoint", time = as.POSIXct("2011-01-01 00:00:00", origin="1970-01-01")) # am$view("AU", type = "difference", time = c(as.POSIXct("2012-01-01 00:00:00", origin="1970-01-01"), as.POSIXct("2012-02-01 00:00:00", origin="1970-01-01")))
Check all the below commands.
am am$etl # you should notice your OS user name on `meta==1L` am$log am$IM()
# saving am$stop() save(am, mapping, extract.commits, extract.numstats, file = "git-am.RData") # now you can shutdown your machine rm(am, mapping, extract.commits, extract.numstats) # loading load("git-am.RData") am$run()
am$xml()
File can be loaded in official Anchor Modeler Test: roenbaeck.github.io/anchor
After loading xml file don't forget to click Play button.
Exported xml does not contain non-model fields like data types. Also it won't export restatability definition as it is part of metadata of model in current XML schema. This may be subject to change in Anchor Modeling, see anchor#4.
Current model can look like that.
If you don't like how your model looks like you can use Layout menu and Release all fixed or Randomize layout options.
commits2 <- extract.commits("commits2014.json") am$load(mapping, commits2, meta = list(meta = 3L, src = "git log"))
am am$etl am$log am$IM()
Insertions and deletions to files.
FileCommit anchor built on composite key.
am$add$A(mne = "FI", desc = "FileEvolution") am$add$a(mne = "PAT", desc = "Path", anchor = "FI") am$add$a(mne = "INS", desc = "Insertions", anchor = "FI") am$add$a(mne = "DEL", desc = "Deletions", anchor = "FI") am$add$t(anchors = c("FI","CM"), roles = c("changed","in"), identifier = c(Inf,Inf)) mapping.numstat <- list(FI = list(c("commit","path"), PAT = "path", INS = "insertions", DEL = "deletions"), CM = list("commit"), FI_CM = list()) am$run()
numstats1 <- extract.numstats("numstats2014.json") am$load(mapping.numstat, numstats1, meta = list(meta = 4L, src = "git log"))
am$view("FI")
am am$etl am$log am$IM()
data.table design assumed from the early days to handle time series data well. Thanks to clustered key it can perform multiple types of joins very efficiently: equi join, rolling join (cross apply/cross join lateral), overlapping join (range join), nonequi join can be done by overlapping join or cross join and filter. See implementation doc for details.
Model any data. Your model can have even single anchor. Model can non-destructivly grow over time so you can think about more entities later!
In this doc you have example on loading git repo data, you can reproduce the same process for any git repo you want.
If you have some DWH data already, you can just use a minimal subset of it.
In development, ready for testing.
You are welcome to add your unit tests which can be included in automated unit testing on package build.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.