title: Using harvestr
for replicable simulations
author: Andrew Redd
vignette: >
%\VignetteIndexEntry{Using harvestr for replicable simulations}
%\VignetteEngine{knitr::rmarkdown}
%\VignetteEncoding{UTF-8}
...
library(harvestr) library(plyr) library(MCMCpack) library(dostats) print.data.frame <- function(x, ...){ print(ascii(df, include.rownames = FALSE), type = 'rest') }
The harvestr
package is a new approach to simulation studies
that facilitates parallel execution. It builds off the structures
available in the plyr
, foreach
and rsprng
packages. What harvestr
brings to the picture is abstractions
of the process of performing simulation.
The theme of harvestr
is that of gardening, stemming from the
idea that the pseudo-random numbers generated (RNG) in replicable
simulation come from initial states called seeds. Figure 1 shows the
basic process for harvestr
.
(workflow.png)[The basic harvestr
process]
The ideas are simple.
gather(n, {[
seed{]})} takes an integer for the number of
seeds to generate.\ Optionally, the seed can be set for replicable
simulations. This uses the rsprng
library to initialize
independent parallel random number streams.gather
are then fed into the
farm
function along with an expression to be generate data.
farm
returns a list of data frames each independently
generated under each of the rng streams.harvest
command, which takes the data from
farm
and applies an analysis function to the dataset. In the
case that the analysis is deterministic harvest
is equivalant
to llply
from the plyr
package. The difference is
with stochastic analysis, such as Markov Chain Monte Carlo (MCMC),
where harvest
resumes the RNG stream where farm
left
off when generating the data.The effect is the results can be taken in any order and independently, and the final results are the same as if each analysis was taken from start to end with setting a single seed for each stream.
Some learn best by example. Here I will show a simple example for the basic process. Here we will perform simple linear regression for 100 data sets. First off we gather the seeds. This step is separate to facilitate storing the seeds to be distributed along with research if necessary.
library(harvestr) library(plyr) seeds <- gather(100, seed=12345)
Second, we generate the data.
datasets <- farm(seeds, { x <- rnorm(100) y <- rnorm(100, mean=x) data.frame(y,x) })
Then we analyze the data.
analyses <- harvest(datasets, lm)
So what do we have in analyses
? We have whatever lm
returned. In this case we have a list of lm
objects containg
the results of a linear regression. Ussually we will want to do more to
summarize the results.
library(dostats) coefs <- t(sapply(analyses, coef)) adply(coefs,2, dostats, mean, sd)
That is very nice, but rather simple as far ananalyses go. What might be more interesting is to perform an analysis with a stochastic component such as Markov Chain Monte Carlo.
library(MCMCpack) library(plyr) posteriors <- harvest(datasets, MCMCregress, formula=y~x) dataframes <- harvest(posteriors, as.data.frame) X.samples <- harvest(dataframes, `[[`, "x") densities <- harvest(X.samples, density) ```r plot(densities[[1]]) l_ply(densities, lines)
To ease longer analyses with many steps caching is available.
unlink("harvestr-cache", recursive=TRUE) # reset cache system.time({ posteriors1 <- harvest(datasets, MCMCregress, formula=y~x, cache=TRUE) })
and when we run it again.
system.time({ posteriors2 <- harvest(datasets, MCMCregress, formula=y~x, cache=TRUE) })
To maintain integrity harvestr
functions take use
digest
to create hashes of the seed, data, and function so that
if any element changes, out of data cache results will not be used.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.