library(knitr) library(utilities) knitr::opts_chunk$set(collapse = TRUE, comment = NA, prompt = FALSE, echo = TRUE)
options(crayon.enabled = TRUE) knitr::knit_hooks$set(output = utilities::ansi_handler) knitr::knit_hooks$set(message = utilities::ansi_handler) # try old.hooks <- set_knit_hooks(knitr::knit_hooks) original_output_hook <- knit_hooks$get("output") knit_hooks$set(output = utilities::create_trimming_hook(original_output_hook))
options(chronicler.attach = FALSE) library(chronicler)
library(stringi)
chronicler
collects all research artifacts (plots, data, models, code)
and their lineage in a file-system-based repository. Such data becomes
handy when you need to to verify the origin (lineage) of an artifact or
browse the artifacts to restore a certain past state of research.
chronicler
has been designed specifically to introduce only minimal
overhead and to provide additional help only when the R tools of choice
are not sufficient. chronicler
does not require any changes to the
research process, personal habits of preferences. Users can produce and
organize the results of their work however they choose, and interact
with chronicler
only when they need to retrieve past data.
This document presents the few foundational concepts of chronicler
and provides a few code examples to illustrate their relevance in data
analysis.
All examples and data sets presented in this document come with the
chronicler
package. In order to start working with the Iris model
example, run the following:
library(chronicler) attach_with_wizard(iris_model())
chronicler
comes with a number of dependencies. If you want to give
chronicler
a try without installing them all, check out the other
vignette, "Running examples".
Data exploration and modelling are messy. We start with a bunch of data, we make a lot of attempts, poke around, do our best to learn anything we can about that data set, and finally produce a predictive model, a recommendation for decision makers or simply a description of what is in the data.
So, what is the "problem"? Why is data exploration "messy", again? It boils down to one thing: complexity of the learning process. Before the final outcome is known, a typical researcher will build many intermatied models, a variety of transformed data sets, a multitude of plots. Some of them will be useful as intermediate steps towards something more refined, but many are useful only because they confirm that certain ideas are wrong.
knitr::include_graphics("graphics/all-together.png")
OK, but why is that a problem at all? It might be, and usually is, because data exploration is also iterative, which means frequent back-and-forth between variuos ideas. And that means you have to have something to return to: code, data, models, plots. But since you never know what might turn out handy, it's better to store everything, or at least as much as possible. But that means a lot of "everything", and then, where to store it? Not to mention making sure that this piece of code produces that particular object or plot. Data exploration is as much about harnessing chaos as it is about actual data exploration. Hence, the messiness.
knitr::include_graphics("graphics/stack.png")
There are, of course, multiple tools and techniques aimed at exactly this problem. RMarkdown, R and Jupyter notebooks, code versioning (git, svn, etc.), to name a few tools, or literate programming and packaging code with unit tests to name some techniques. Here we want to propose something slightly different, which combined with existing tools can help with the challenge of organizing artifacts and, thus, knowledge, acquired in a data exploration process.
The approach described here, and implemented in the chronicler
R package,
is built around two main components: an R object tracker and a search
engine. However, because we track not only R objects created and
transformed in R session, but also code, plots and console printouts[^printouts],
from now on we will be talking about artifacts and not objects.
[^printouts]: Support for console printouts is coming in the next
version of chronicler
.
Let's start with the artifact tracker. It will be easier to explain its
inner workings if we have some sample data already collected. Thus, we
start by loading the package and building a very simple regression model.
To find more details about this example open ?iris_model
manual page.
rc <- knitr::read_chunk rc(system.file('scripts/iris-model.R', package = 'repository'))
library(chronicler)
The following plot explains how R objects and plots created by our sample code are related to artifacts stored in the repository. Notice how the number of artifacts grows with each R command executed, and how at the same time some of the objects in R session (more specifically, in R's global environment) are replaced by new ones which share the same name. This is exactly the point of the tracking the R session: to collect and preserve all artifacts (R objects, plots, printouts, ...).
Notice that each artifact is assigned an unique identifier; in our diagram these are denoted as four hexadecimal digit which follow artifact name. Each command tracks its parent which means the complete lineage[^lineage] of each artifacts can be retrieved upon user request.
The tracker itself is a callback function registered in R session with
the standard R API, the addTaskCallback()
function from the base
package. Thus, the tracker can be loaded and executed in R out-of-the-box,
without any additional configuration or changes to R itself.
[^lineage]: Oxford Dictionary of English: lineal descent from an ancestor; ancestry or pedigree. All the parent artifacts and commands used to create the given artifact are its lineage.
Tracking artifacts would be quite useless without a proper method to
retrieve those relevant to the ongoing research. One of the core
assumptions in the design of chronicler
is that it should fit into any
given way of doing research in R. This assumption led us to choose a
file-based search mechanism.
A user points to a file - a plot, a data set, a serialized model - and
chronicler
finds all artifacts matching the contents of that file.
These artifacts can be then displayed in their specific context, which
includes the R command which created them, their lineage (parent
artifacts).
chronicler:::attach_to_repository(iris_model())
First, a trivial object search. The iris
data set is assigned to
variable x
in the first line of the example.
identify_artifact(iris)
Artifact 2b0c3bb1
is indeed a copy of the iris
data set and a quick
look at the expression confirms that it is the result of the first command
in our example.
Now we will search for a plot stored in a file. chronicler
contains
such plot, one of the artifacts in the iris_model()
example. Let's
find the path and see the plot itself.
path <- system.file('artifacts/iris-predictions.png', package = 'chronicler')
cat(sprintf("![Figure 3: Plot stored under chronicler/artifacts/iris-predictions.png](%s)", path))
Searching by file contents reveals another artifacts, this time the
first of the two plots from our example. Let's also see what can we
learn about the context in which that plot was run: explain()
prints
a few entries from the history leading up to the artifact in question.
identify_artifact(path) explain(identify_artifact(path))
So, with chronicler
tracking your work, you will always be able to
find the exact sequence of commands that led to a plot or a model
sitting somewhere in your project's folder.
If you are less interested in searching artifacts that match your data or plots, and you would rather just explore artifacts stored in the repository, here are a few ways to take a look at them.
All artifacts stored in the repository are directly available in the R session. If you happen to know the name of the artifact, as it was assigned upon creation, all you need to do is this:
artifacts$m
If there is more than one artifact matching that name, chronicler
will
tell you so and ask to be more precise.
artifacts$x
Sometimes, which is the case with plots, the artifact doesn't have a name. On those occasions you can use that artifact's unique identifier (surrounded by backticks, since identifiers are not proper R names). Where to get that identifier from, though? Try this shortcut:
artifacts$plots
Now it's quite straightforward - we simply copy the second value
printed in green and feed it to the artifact
object.
artifacts$`909a1c6d`
If you want to see the actual plot, pass the object to the plot
function,
like so:
artifacts$`909a1c6d` %>% plot # same as plot(artifacts$`909a1c6d`)
Of course, all artifacts can be accessed by their identifiers, not
only plots. Thus, in our case artifacts$m
is the same as
artifacts$dcd273b7
.
What if you don't know the name or id of a given artifact? Well, in that case you might want to use what you do know and build a query that will narrow down on what you're looking for.
The artifacts
objects does not only have a set of convenient shortcuts
but also gives access to an interactive query builder. Let's see what
will the default help screen for artifacts
tell us.
artifacts
"Query keys" mentioned in at the bottom of the help message are names
which we can pass via the dollar-sign operator to the query builder.
If you still remember, we got an error message calling artifacts$x
which suggested calling artifacts$name$x
instead. That is exactly
the query builder, so let's give it a try now.
artifacts$name$x
All right! So what do we see here? The message has three parts: first, it shows the actual query used to retrieve artifacts. This is followed by the descriptions of the first three artifacts that match the query. Finally, we ca see a simple footer with the number of artifacts not shown and other query keys which can be used to further narrow down the set of artifacts.
Currently, chronicler
supports the five following query keys:
class, id, name, session and time. When building the
query, you can print the key without a value to see the help message.
The message will show the name of the tag that needs to match that part
of the query and a list of possible values with the number of matching
artifacts given in parentheses. Let's try that with artifact names:
artifacts$name
So this is exactly how we arrived at artifacts$name$x
in the first
place.
What about other query keys? class
, id
and session
follow the same
pattern: class
needs to match one of the classes declared by the artifact,
id
needs to be an exact match to an artifact identifier, and session
points to the R session in which a given artifact was created. In our
example there is only one R session but chronicler
supports spanning
your analysis throughout multiple sessions.
Let's see the help messages and the actual query results for these three query keys.
Help message:
artifacts$class
And a sample query.
artifacts$class$data.frame
Help message lists all identifiers.
artifacts$id
Sample query retrieves always a single artifact.
artifacts$id$`0f1105f2`
Each R session is assigned an unique identifier.
artifacts$session
All artifacts created in that R session can be retrieved like so:
artifacts$session$`615195b3`
The time query key is a little different. It supports a number of values that are not present directly in artifacts data but can be computed from that data. Let's first take a look at the help message.
artifacts$time
What follows is a sample query. However, since the example repository was created too long ago, there will be no matching artifacts.
artifacts$time$yesterday
We took a look at the query builder. Our example, iris_model()
, is very
small but in general a query can match many artifacts, too many to browse
efficiently. That is when we might want to see the relationships between
artifacts. They can be printed out by two special query keys: tree
and history.
These two keys did not appear in the help messages we have seen so far, and for a reason: they are used only for printing, never for searching. However, they do appear among the supported values in the tab-completion help screen (or RStudion's tab-completion help-dialogue).
Let's start with the history printout style. Since it requires a query, we start by choosing the only R session present in the example, which effectively chooses all artifacts.
artifacts$session$`615195b3`$history
What about the tree style? It shows the same artifacts, but this time they form a tree. Parent artifacts are placed closer to the root, and the edges of the tree connect artifacts sharing a lineage.
artifacts$session$`615195b3`$tree
What if we would like to see only data frames in the repository? Here are the data frames and their lineage.
artifacts$session$`615195b3`$class$data.frame$tree
chronicler
supports one more way of accessing artifacts: through an
index in the query result. Let's retrieve the first and the second
artifact whose class matched data.frame
.
artifacts$class$data.frame[[2]]
And this is how we can access the data.frame
object itself.
artifacts$class$data.frame[[2]]$value
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.