Reproducibility and accountability are increasingly demanded in statistical computing. This implies a need for new computing systems and environments that can transparently, easily, and robustly satisfy reproducibility and accountability standards. To this end, we have developed a system and R package called adapr (Accountable Data Analysis Process in R) that is built on the idea of accountable file provenance. \pkg{adapr} organizes analytic computations within a project as a directed acyclic graph (DAG) in which data, computer programs, and outputs are nodes. adapr works by integrating a version control system (Git), cryptographic hashes, and a dependency tracking database. The adapr package enables treatment of the analysis as a single unit that evovles over time as stored within the version control system. The implementation uses an R Shiny interface (adaprApp) creating an efficient user experience concealing the complexity and reducing overhead of tracking relative to standard data analytic coding.
An adapr project is a set of related R Scripts that conduct analyses related to data within a data directory or database. The project is a self-contained folder with all data, resources (with the exception of base R libraries, which are version tracked), and results.
In order to initially set up adapr, there is a helper
function adapr::default.adapr.setup()
that will ask a short sequence of questions (working directory, preferred username, version control preference, etc.) and will create the first adapr project adaprHome within the spedicified document directory of the computer. The setup function requires RStudio to work or requires the user to locate resources such as pandoc
automatically and sets environment variables accordingly. The setup function creates a directory at the top level of the user directories
called ProjectPaths
with two files projectid_2_directory_adapr.csv
and adapr_options.csv
that are required to locate projects and configure adapr, respectively.
Projects can be created within the R console using the initProject
function initProject("MyProject","Path to project","Path to publish directory")
or with the adapr App (launched by adaprApp()
). When a new project is created, the project is created as a single folder within a file system (OS X or Windows) with the following structure:
The fundamental unit of the project is an R script that has two corresponding elements. The first is a Results folder with the same name as the R Script. This Results Directory for each R script resides within the Results directory. The second is an R markdown file with the same name (with file extension .Rmd) within the Markdown directory (See above). A read_data.R
program is created automatically upon creation of the project. This creates a results directory (Results/read_data.R/
) and an R markdown file (2.3 Markdown/read_data.Rmd
).
In addition to the main project directory, each project is associated with a “Publish” directory that can be used to share results with collaborators. The Publish directory could be a Dropbox, file server or web server directory. A list of all projects can be obtained using the listProjects()
command. The function removeProject(project.id)
will remove the project from the project listing file, which dissociates the project from adapr, but does not remove the project from the file system. The function relocateProject()
tree will edit the location for which adapr will associate with the project,
and relocate.project()
can be used to identify projects built by others.
The following functions operate on the project level:
listProjects() #Returns the project listing
openProjectList() # Browses the project listing file for editing
initProject("myProject",project.path,project.publication.path) #Creates myproject in the following location
removeProject("myProject") # Removes the project from the project listing
relocateProject("myProject",project.path,project.publication.path) #Identifies or Import new project location, but does not move the project
setProject("myProject") #Changes options() so that myproject is the default project. Most functions use this as default project ID.
getProject() # Retrieves the working/default project. Used by other functions avoid respecification of the project ID
reportProject("myProject") # Creates a html report of all programs and I/O with descriptions and run times.
graphProject("myProject") # Displays the project DAG indicating which programs are out of synch
synctestProject() #Tests project synchronization
syncProject("myProject") #Synchronizes the project DAG
listScripts("adaprHome") # lists the programs in the adaprHome project with descriptions
showProject() # Browses/opens project folder in file system
commitProject("myProject",commitMessage) # Perform a git commit for version control
.
publishResults("myProject") #Copies selected result files to publication directory
Programs can be created by using makeScript
that takes three main arguments 1) project ID, 2) program name,
3) program description. Programs can be created, executed, listed, and removed using the following 4 commands:
makeScript("MyProgram.R","Produces little","adaprHome") # Creates program
runScript("MyProgram.R","adaprHome",logRmd=TRUE) # Executes Program and creates and optionally creates an R markdown log
listScripts("adaprHome") # lists the programs in the adaprHome project with descriptions
.
removeScipt("adaprHome","MyProgram.R") # Removes Program and Results
The adapr R scripts have a header that creates a list object source_info that contains the read/write and project information, and a footer that writes the read/write dependencies into a file corresponding to the R script within the Program/Dependency directory. The read/write information is collected within the body of the R script by wrapping the read/write commands within three R commands. The first is the read command:
Read(file.name = "data.csv", description = "Data file", read.fcn = guess.read.fcn(file.name), ...)
This command returns the file object read by the function read.fcn and tracks the description and the file hash with “…” passed as additional arguments to the read function. Note that in the usage example below, it is transparent and parsimonious, especially given that the Read function defaults to reading from the project’s data directory:
Dataset <- Read(“datafile.csv”,”my first dataset”)
The second command is the Write function:
Write(obj = NULL, file.name = "data.csv", description = "Result file", write.fcn = guess.write.fcn(file.name), date = FALSE, ...)
A usage case is given below, which writes the cars
dataset to the program's Result directory.
Write(cars,"cars.csv","dataset about cars from R")
This command writes the R object obj using the function write.fcn and tracks the description and the file hash with “…” passed as additional arguments to write.fcn
. If the file.name argument has a .Rdata extension then the R object is written to the results directory as an R data object. The last function:
Graph(file.name = "ouput.png", description = "Result file", write.fcn = guess.write.fcn(file.name), date = FALSE, ...)
This command writes the opens a graphics device for plotting using the function write.fcn and tracks the description and the file hash with “…” passed as additional arguments to write.fcn. The date variable is a logical that determines whether a date is added to the filename. This function is followed by a graphics command and closed by a dev.off() or graphics.off() statement. Below is an example.
Graph("histogram.png","hist of random normals") hist(rnorm(100)) graphics.off()
This code snippet creates a file in the results directory of the project called histogram.png with a histogram of 100 standard normal variables. If the Read/Write
functions are not used for reading/writing then the input/output can be still be tracked using the funtions ReadTrack/WriteTrack
. Another useful function is the following
LoadedObject <- loadFlex("read_data.R/mydata.rda")
This function reads in an R data object with the .rda
file extension written by another program. This function searches through all results from the project to identify file, and returns the corresponding R object (assigned to in LoadedObject example). The function follows Write(mydata,"mydata.rda","some data")
within another R script (read_data.R) to avoid reading the same file twice and to pass preprocessed data from one R script to another. Each R script ends with a footer
dependency.out <- finalize_dependency()
that collects the file hashes of reads and writes and writes the dependency information to the Programs/Dependency directory. IMPORTANT: If finalize_dependency()
function does not execute, then the R markdown file will not be rendered and the dependencies will not be available for use by other R scripts.
An R markdown file is created automatically when an R script is created within the Programs/Markdown directory. This .Rmd file is executed when the R script is executed and rendered as a side-effect of the finalize_dependency()
function. The rendered markdown resides within the Results directory corresponding to the R script. Additional R markdown files can be created using the create_markdown
function and rendered within the Results directory with the Render_Rmd
function. Executing the R script will render these markdown files.
Package management is needed for reproducibility. R packages needed for each R script can be installed in the usual manner if using the default R library. However, the adapr facilitates installation, loading, and tracking of packages. For project that use a project-specific library, the Library()
function combines the install.packages
function with the library
fuction. Library("ggplot2","cran")
will install the package ggplot2
from the cran
archive into the project specific library (if necessary) and then load the pacakge for use. The package versions are recorded within the program results directories using the session_info()
function from the devtools
package.
If the project is imported by another user, the data regarding the packages needed will be used by
the installLibrary
function to install all neede R packages. Alternatively, the needed R packages will be installed and loaded by the Library
function if it is used throughout the project.
All programs and results are summarized within a single HTML document with descriptions, dependency information, and links to programs and results. This information can be obtained within the Report tab of the app. After clicking the “Report” button, the HTML project summary is written to the Results/Tree_controller.R/
directory. The Report tab also is capable of launching project R Shiny apps that access project data or Results. The Report tab can also check the file provenance within projects. See Figure 3. The function create_program_graph(project.id)
will generate
a graph of the project with R scripts and their dependencies.
Two critical parts of accountability and reproducibility are addressed within adapr and the Synchronize tab of the app. First, the app can check the synchronicity of the project, and the function synctestProject("project.id")
reports the synchronization status of the project. That is, if the file modification times or the file hash (data fingerprint) of the input, output, or script file changed from the corresponding modification times or hash within the dependency file, then the project is out of sync or not synchronized. The functions and the app detect lack of synchrony and can execute the corresponding R scripts that are needed to achieve synchrony. The function syncProject("project.id")
will run all scripts needed for synchronizing. See Figure 4. Next, adapr uses the git version control system, which can be operated within the Synchonize tab. While adapr uses calls to the gitr package, Git must be installed for the Git search features to work. There are platform independent and free Git clients available for download (https://git-scm.com/downloads/guis). The commit button within the app takes a project snapshot of all R scripts and adds the synchronization status (synchronized or not) to the commit message. Likewise, commitProject("myProject",commitMessage)
will perform a git commit from the R console. Because the file names and file hashes are stored within the version control system any data or result file can be matched within the version control history to determine its provenance.
It is important to note that adapr automatically creates a project specific repository and adds all R script files into the project's Git repository, but it does not automatically use formal version control for data or results files because these files may be too large for the version control system to efficiently track the changes. However, the file hash is computed and stored within the Dependency
directory for all files that are read or written within the project so that all relevant files and their provenances can be identified if a commit was made when the file was read/written. File histories can be identified within the Run Report tab that executes a file hash search and returns
the Git commit associated with the file including the date, author, and file description.
The project can be committed (a snanpshot taken) with the commitProject(project.id,message)
command.
The Git history of a file is also obtained with the gitProvenance(project.id, filepath)
function which requires Git to be installed.
There is an interactive project map feature within the synchronize tab. The "Examine Program Graph" button displays the DAG of all the project's R scripts colored by there synchronization status. These scripts can be executed by clicking on the colored point, and their file I/O can be examined by selecting or "brushing" over the colored point. This reveals the file I/O graphic in an adjacent panel. These files are labeled by their descriptions and can be opened by clicking on the file or their file directories may be browsed by selecting or "brushing" over the colored file labels. A limitation is that very large DAGs or very large input/output lists for an R script can overwhelm the visual space of these graphics. This will be addressed in future versions of the app.
The specific results from the project or the entire project can be sent within the Send tab of the app. This tab creates a file within the Programs/Support_functions directory that contains a list of files that are selected for publication (i.e., copied to the project’s publication directory, which may be a shared server folder, a web server, or cloud server such as Dropbox). A button automatically sends these files to the publication directory. Also, the full project can be copied to the publication directory. The data directory is not sent by default to protect potentially sensitive data elements. There is a selector option for sending the data (or not) and a button for sending the full project. See Figure 5. There are 4 functions related to publishing:
getPubResults(project.id) # Get data frame of files for publication
publishResults(project.id) # Copy selected files to publication directory
browsePubFiles(project.id) # Browses publication table for editing
The current configuration options can be retrieved with getAdaprOptions()
. This function reads the information from the ProjectPaths
folder in the root directory. This location can be changed setting the R option adaprHomeDir
within the R profile. The adapr package can be configured with the setAdaprOptions()
. The setAdaprOptions(option,value)
function the first argument is the option and the second argument is the option value as a character string. The "git" option takes TRUE
or FALSE
depending on whether
git is used. For example, set_adapr_options("git","TRUE")
will modify the file
"adapr_options.csv". The "path" option indicates the paths for finding resources such as pandoc. There are also "project.path" and "publish.path" options that indicate default publication directories. Selecting the git user can be done with the gitConfigure(user.name,user.email)
function.
There are some convenience functions that can be used within the R console. These functions can be executed without specifying the project, if setProject(project.id)
is used to indicate the working project using the "adaprProject" R option within options
.
showProject()
and showResults()
will open the project or results directory with the file browser.
listBranches()
will list branches, orgins, and description that are available for loading.
listDatafiles()
will list the files and descriptions in the Data
directory.
listScripts()
will list the R scripts in the project.
searchScripts()
will search for a string in the set of R scripts and markdown files.
openScript()
opens script directly or selecting from a list of project scripts.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.