library(knitr)

knitr::opts_chunk$set(collapse = TRUE, comment = "#>", prompt = FALSE, echo = TRUE)

Problem description

Data exploration projects can span multiple weeks or months and result in multiple artifacts (plots, models, data sets, code) being produced. These artifacts serve as evidence for a narrative about the data, as it is developed and later reported back to its consumer: fellow researchers, general public, business stakeholders. Thus, a need to organize, categorize and prioritize these artifacts: categories and priorities, however, are never ultimately settled, or known, as the narrative is in its nature open-ended. Therefore, the need to organize and reorganize artifacts as their number expands, or as the narrative is adjusted, become the most important of the three.

In general, there is a multitude of ways to keep track of the evolution of data research. The following summary aims at underlining certain aspects of keeping record that will help us explain the goal of the approach proposed in this document.

When considering means of organizing research artifacts in R, we can identify a number of general approaches to the problem. The most basic one is to store artifacts in the filesystem and use the directory structure and the file and directory names as the tools of keeping track of the research and its important findings. Data, plots, models and code are therefore kept as separate files and the burden of maintaining order and consistency among them (that is, that code can actually produce plots, models and data) is on the researcher. The commentary, which explains the meaning and relationships of artifacts, is kept along artifacts themselves or in a separate notebook or notebooks, electronic or physical.

Another approach is to use literate programming: mix commentary, code and the results of code execution (plots, models, data) in a single document. One example of that approach is the knit package: a document contains free text and code excerpts; during "compilation" the code is run and the results are embedded in the document as they would appear in R console. Because the only input is the code (and the occasional source data) the document is always consistent and each artifact embedded in that final document can be recreated and its origin proven.

Yet another approach is to create an R package or a Shiny application and allow the consumers of the narrative interact with it. This takes the longest time and the most significant effort, and thus is only the means of choice when the narrative is stabilized.

TODO time and effort spectrum: which methods are more adequate in early research, which in later; the risks of loosing the consistency; the important dimensions: consistency, completeness, selectiveness; time and effort required to maintain the (living) record

Abstract

Recent development in the R community can be grouped as follows: extending and simplifying the programming interfaces (tidyverse) support for big data sets (e.g. interfaces to Spark) new algorithms to model interactions in data reproducible programming

This paper explores a direction that to some extent overlaps with the last category: bookkeeping (recording user interactions and their results) and reporting (presenting the historical records back to the user) in a data project and.

It starts with the premise that a considerable amount of time and effort in a data project is utilized on recording and browsing findings.

The reproducible programming movement produced tools extremely useful when findings are to be reported for the sake of consumers other than the user who produced them in the first place.

There seems to be a certain lack of tools, however, that server the purpose of reducing the operational burder on the researcher himself.

It is important to note that in a project spanning over days or weeks, the task of reconstructing the context of certain questions stated and results obtained in a data project is not trivial and at least requires consideration and time.

We also start with the premise that the most important aspect of any data project takes place in the mind of the researcher and takes the form of a narrative about the data: interactions and patterns present in the data, explanations of those patterns and interactions and ways of modelling them for the sake of predictions or forecasts.

A researcher needs expected to maintain a record of his thoughts and ideas that can but doesn't have to reside together with the code and data.

Typically, tools designed for other use cases are repurposed for this task: markdown notebooks, scripts with comments, sets of plots written to the filesystem, PDF reports, etc.

Although these tools perform very well when it comes to reporting, using them to bootstrap another day of work is less straightforward: one needs to locate objects (data, plots) and code relevant to their ongoing work and then use them (collect, copy, re-run) as the starting point for the current part of the project.

R has the basic support for bookkeeping: stores the history of commands, preserves the state of the session, implements serialization and deserialization of arbitrary objects (code, data, plots) in the filesystem. The meta level, however, that has to do with the relationships between these objects, as well as their relative importance are not recorded by these mechanisms.

The R user is expected to do that work - maintain a comprehensive view and understanding of the artifacts produced within the project - on their own.

This is where our package aims: at supporting the user in the task of recording, browsing and presenting the information about artifacts, the network of relationships between them and from this, the overall narrative about the understanding and applicability of the knowledge obtained in said project.

Abstract 2

Unless and until knowledge is internalized, it has to be organized in an intuitive structure which tends to be a personal choice. Be it a report, a mind map, a compiled notebook or a series of plots, knowledge needs to be collected and presented repeatedly in order to continue research and share it with peers.

Three phases:

REPOSITORY CANNOT REQUIRE PRE-PLANNED INTERACTIONS, EVERYTHING NEEDS TO HAPPEN AT AN AD-HOC BASIS

  1. store and keep all artifacts produced in a session; after a restart re-attach to the store and make it simple to keep track, either by choosing the most-recent commit as the starting point, or by creating a brand-new commit out of the contents of the session and working from there

  2. make it easy to see all that has happened so far, re-build the context after a break, be it for the night between two days of work, for the weekend or a longer one when coming back to a project that was put on hold

re-building the context

but also make it simple to to rewind the history, or to hold a number of artifacts in a temporary notebook... or maybe develop stories? maybe position this whole thing as a STORY-TELLING tool? so a browser based on the trees of objects (history/origin) + set of stories where artifacts can be dumped to in order to maintain fast access to currently important objects

  1. ad-hoc reads and pipeing objects directly into ad-hoc code: tracker$objects$x %>% {lm(x ~ y, data = .)}

Supporting data exploration in R



lbartnik/chronicler documentation built on May 23, 2019, 8:21 p.m.