library(knitcitations)

Being able to generate all figures and plots for an academic submission from raw data can be considered the "gold standard" of reproducible research.[^1] By automating the process, it is possible to easily verify reproducibility at any stage during the analysis. Automation also allows easy recreation of the entire analysis based on modified inputs or model assumptions. However, rerunning the entire analysis starting from raw data soon becomes too time-consuming for interactive use. Caching intermediate results alleviates this, but introduces the problem of cache invalidation[^2].

R packages offer everything necessary to conduct statistical analyses: They can store data, code, and written documentation (in the form of vignettes). Recent efforts have considerably simplified the packaging process in R, there is first-class support in R and RStudio (and probably in many other environments). The rpkgweb package offers a framework where a statistical analysis can be distributed over several interdependent packages, each serving a dedicated purpose (e.g., holding raw data, munging data, input validation, modelling, analysis, reporting, ...). Dependencies between packages are specified as usual in the DESCRIPTION file. The framework tracks which downstream packages^3 need to be rebuilt if a package changes, and allows updating mutually independent packages in parallel. Interactive work is possible within the context of each package, while maintaining full reproducibility[^4].

[^1]: Numerous contributions to the R ecosystem are geared towards this aim (including but not limited to interfaces to other data formats, weaving of code and text, and of course the immense variety of statistical modelling tools)

[^2]: (which, according to Phil Karlton, is one of the two hard things in computer science)

[^4]: (which can be continuously monitored by using version control systems and continuous integration)

Related work

The ProjectTemplate package r citep(citation("ProjectTemplate"))

Design goals

Framework

Package organization

Workflow

Data preparation

Reiterate as necessary!

Data processing

Reiterate as necessary!

Results

How design goals are achieved

Usability: Simple concept, packaged in a tested and documented package

Automation: Generation of a Makefile

Reproducibility: Clean and rebuild

Integration with existing infrastructure: Package as building block

Interactive processing: Work on one package at a time

Scalability: Use packages

Large datasets: Use packages

Caching: Use packages

Parallelization: make -j ...

Example: Calibrating a survey

Other uses

Open questions

Limitations

Summary and outlook



krlmlr/rpkgweb documentation built on May 20, 2019, 6:18 p.m.