Lightweight data provenance tracking. More comprehensive/involved approaches include recordr, rProvenance, and RDataTracker.
Currently this works as follows:
Each object to be tracked has an attribute "provenance" attached to it once via createProvenance
. This attribute essentially holds the provenance lookup name, and is a combination of the object's name and date/time.
The lookup name is used to pull out an element from a .provenance_list
list, currently stored by default in the global environment. For example if object x
gets assigned a "provenance" attribute of "x_12345", then .provenance_list[["x_12345"]]
holds the provenance record.
This record is a data frame with timestamp, caller, message, and digest (MD5) of x
at that point in time. updateProvenance
only changes this, not the original object x
, so avoiding (I hope) expensive copies.
The disadantage of this approach is that if x
is deleted, its record just hangs around; and if y <- x
then y and x will share the same record. Ugh, hadn't thought about this last case.
* Finally, note that this is opt-in provenance tracking. That is, we have to update the record for x
manually each time a change is made. This is by design (lightweight...) but worth considering.
> library(dataprov)
> x <- 1
> x <- createProvenance(x)
Creating provenance list
> updateProvenance(x, "created x and its provenance record")
> x <- x + 1
> updateProvenance(x, "added 1")
> provenance(x)
timestamp caller message digest
1 2015-08-21 14:26:14 <NA> created x and its provenanc... 5fd7a492
2 2015-08-21 14:26:27 <NA> added 1 d934fb15
> listProvenances()
Object Creation Entries
1 x 1440181552.45085 2
Data provenance broadly refers to a description of the origins of a piece of data and the process by which it arrived in a database or analysis (see e.g. http://db.cis.upenn.edu/DL/fsttcs.pdf). This package implements a simple system for tracking operations performed on any R object. It's not automatic though--you have to call updateProvenance()
each time you want to put a new entry into the provenance.
The provenance is stored as an object attribute (currently a data frame, but don't depend on this--use the accessor functions below!). Available functions include:
Public
createProvenance(x, env = .GlobalEnv)
- attach a "provenance" attribute to x
, create the provenance_list if necessary, and create an empty provenance record. Necessary before any of the functions below can be called.
updateProvenance(x, message, caller = NULL, env = .GlobalEnv)
- add a provenance entry
provenance(x, n = NULL, env = .GlobalEnv)
- gets the entire data frame, with attribute "dataprov". If n
is supplied, return a list corresponding to the nth entry
print.dataprov()
- summarize a data provenance - abbreviating all fields (digest, message, etc) for easy reading
* listProvenances(env = .GlobalEnv)
- summary list of current provenances records
Private
dataprov() - object creation
replaceProvenance(x, p)
- replace entire provenance of x
with p
???
mergeProvenance(a, b)
- merge provenance of b into that of a. Ugh. Useful?
Could offer an option to save x
each time the provenance is updated? Not a version 1.0 feature I don't think
* provenance(x) <- p
- replaces entire provenance of x
with p
(must be data.frame with correct fields)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.