library(knitcitations)
Being able to generate all figures and plots for an academic submission from raw data can be considered the "gold standard" of reproducible research.[^1] By automating the process, it is possible to easily verify reproducibility at any stage during the analysis. Automation also allows easy recreation of the entire analysis based on modified inputs or model assumptions. However, rerunning the entire analysis starting from raw data soon becomes too time-consuming for interactive use. Caching intermediate results alleviates this, but introduces the problem of cache invalidation[^2].
R packages offer everything necessary to conduct statistical analyses:
They can store data, code, and written documentation (in the form of
vignettes). Recent efforts have considerably simplified the packaging
process in R, there is first-class support in R and RStudio
(and probably in many other environments).
The rpkgweb
package offers a framework where a statistical analysis can be
distributed over several interdependent packages, each serving a
dedicated purpose (e.g., holding raw data, munging data, input
validation, modelling, analysis, reporting, ...). Dependencies between
packages are specified as usual in the DESCRIPTION
file.
The framework tracks which downstream
packages^3 need to be rebuilt if a package changes, and allows
updating mutually independent packages in parallel. Interactive work is
possible within the context of each package, while maintaining full
reproducibility[^4].
[^1]: Numerous contributions to the R ecosystem are geared towards this aim (including but not limited to interfaces to other data formats, weaving of code and text, and of course the immense variety of statistical modelling tools)
[^2]: (which, according to Phil Karlton, is one of the two hard things in computer science)
[^4]: (which can be continuously monitored by using version control systems and continuous integration)
The ProjectTemplate
package r citep(citation("ProjectTemplate"))
ProjectTemplate
mvbutils
modules
memoise
R blog
Usability
Automation
Reproducibility of results
Integration with existing infrastructure and workflows
Interactive processing
Scalability
Large datasets
Caching
Parallelization
Conflict resolution
One package, one concern
Package web: Directory with several packages
Dependencies via DESCRIPTION
Main verbs
load_all
test
bump
(or uninstall
)check_up
(= purge revdeps + test
+ install
)Downstream dependencies: Makefile
Additional verbs
document
check
Defining functions: In R
folder
Executing code
load_all
test
install
Raw data
use_data()
Validation, checking assumptions
Generating data (munging, intermediate or final results)
build
use_data()
during test()
, perhaps in a child packageR_TESTS
-> custom "testing" framework that runs
when started from devtools
but not with R CMD check
Configuration
Plots and tables
build
test
Documentation
roxygen2
test
check
-- no DESCRIPTION
foundR_TESTS
environment variable, don't build if
not emptybuild_vignettes
Start with raw data packages
Add dependent cleanup and munge data packages
Add packages that check assertions
Reiterate as necessary!
Scripts in R
Reiterate with load_all()
Extract functions
roxygen2
Extract tests
If loading/testing a package becomes too slow for interactive work, split it
Reiterate as necessary!
New package
All results for presentation should be available in dependent packages
Write documentation in vignettes
Change upstream packages if further results are needed
Makefile
RStudio + devtools
Git
CI systems
packrat
make -j ...
Package development
Integration with other Makefile
-s?
Makefile
Package generation must be cheap -- mason
Multi-project IDE -- not yet (RStudio?)
Constantly switching between R
and tests/testthat
use_testthat_symlink
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.