title: Using harvestr for replicable simulations author: Andrew Redd vignette: > %\VignetteIndexEntry{Using harvestr for replicable simulations} %\VignetteEngine{knitr::rmarkdown} %\VignetteEncoding{UTF-8} ...

library(harvestr)
library(plyr)
library(MCMCpack)
library(dostats)

Introduction

The harvestr package is a new approach to simulation studies that facilitates parallel execution. It builds off the structures available in the plyr, foreach and rsprng packages. What harvestr brings to the picture is abstractions of the process of performing simulation.

Process

The theme of harvestr is that of gardening, stemming from the idea that the pseudo-random numbers generated (RNG) in replicable simulation come from initial states called seeds. Figure 1 shows the basic process for harvestr.

(workflow.png)[The basic harvestr process]

The ideas are simple.

The effect is the results can be taken in any order and independently, and the final results are the same as if each analysis was taken from start to end with setting a single seed for each stream.

Example 1 - Basic Example

Some learn best by example. Here I will show a simple example for the basic process. Here we will perform simple linear regression for 100 data sets. First off we gather the seeds. This step is separate to facilitate storing the seeds to be distributed along with research if necessary.

seeds <- gather(25, seed=12345)

Second, we generate the data.

datasets <- farm(seeds, {
  x <- rnorm(100)
  y <- rnorm(100, mean=x)
  data.frame(y,x)
})

Then we analyze the data.

analyses <-  harvest(datasets, lm)

So what do we have in analyses? We have whatever lm returned. In this case we have a list of lm objects containg the results of a linear regression. Ussually we will want to do more to summarize the results.

coefs <- Reduce(rbind, lapply(analyses, coef))
apply(coefs, 2, dostats, mean, sd)

Example 2 - Stochastic Analysis

That is very nice, but rather simple as far ananalyses go. What might be more interesting is to perform an analysis with a stochastic component such as Markov Chain Monte Carlo.

library(MCMCpack)
library(plyr)
posteriors <- harvest(datasets, MCMCregress, formula=y~x)
dataframes <- harvest(posteriors, as.data.frame)
X.samples  <- harvest(dataframes, `[[`, "x")
densities  <- harvest(X.samples, density)
plot(densities[[1]])
l_ply(densities, lines)

Example 3 - Caching

To ease longer analyses with many steps caching is available.

unlink("harvestr-cache", recursive=TRUE)  # reset cache
system.time({
    posteriors1 <- harvest(datasets, MCMCregress, formula=y~x, cache=TRUE)
})

and when we run it again.

system.time({
    posteriors2 <- harvest(datasets, MCMCregress, formula=y~x, cache=TRUE)
})

To maintain integrity harvestr functions take use digest to create hashes of the seed, data, and function so that if any element changes, out of data cache results will not be used.



halpo/harvestr documentation built on April 2, 2020, 8:43 p.m.