knitr::opts_chunk$set(
  error = FALSE,
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5)
lang_output <- function(x, lang) {
  cat(c(sprintf("```%s", lang), x, "```"), sep = "\n")
}
cc_output <- function(x) lang_output(x, "cc")
r_output <- function(x) lang_output(x, "r")
plain_output <- function(x) lang_output(x, "plain")

Stochastic models can be used in statistical inference via methods such as particle filtering but in practice doing so requires that the models can be run over and over again very quickly. While R has excellent support for sampling from distributions, it is not necessarily well suited for this sort of problem because R is single threaded, so we are forced to evaluate realisations of the stochastic process in series (one after another) rather than in parallel.

The dust package provides tools to help write stochastic models that can be evaluated in parallel. It does not directly provide statistical methods; see the mcstate package for that. Instead, it focuses on providing:

A simple example - random walk

Consider a unbiased random walk; at each time step we move our position with a draw from a normal distribution with mean 0 and some standard deviation. We consider a single individual moving but will eventually simulate a family of these individuals, each independent. All simulations have a sense of time - a unitless measure of time "step" will be used but it's up to you how it is interpreted (time is a non-negative integer, implemented using size_t).

Model code

To implement this model in dust we write out a C++ file that looks like:

cc_output(readLines(dust:::dust_file("examples/walk.cpp")))

There are two main parts here; the class (walk) and an interface function (dust::dust_pars).

The class is what does most of the work. It is important that this does not use anything in the R API as it may be called in parallel. Therefore use only C++ standard library types. It has to provide every public type or method as the example above. First, the five types:

The constructor must take only one argument, being const dust::pars_type<model>&. You can do what you want in the constructor - here we just copy shared (containing parameters) into the object with the pars argument (we use the private member shared here but this can be anything you want, though the type is important). The dust::pars_type<> template is a wrapper that contains two fields shared (as dust::shared_ptr<walk>) and internal (as internal_type).

The method size() must return the "size" of the system - how many state variables it has. Here we're considering a single individual moving randomly so the size is always 1. Many models will have a size that depends on their parameters, though the size cannot be changed after construction.

There are other forms of the dust::pars_type constructor which would allow specifying the internal data; typically this is used to create scratch space (allocating vectors as needed) rather than compute values because the scratch space and model state can become separated from each other (by direct setting of state, or by shuffling model state between particles).

Constructing a model

The function dust::dust will "compile" a model for us to use:

path_walk <- system.file("examples/walk.cpp", package = "dust")
walk <- dust::dust(path_walk)

However, this is also bundled into the package and can be loaded with:

walk <- dust::dust_example("walk")

The object itself is an "R6ClassGenerator" object:

walk

Create an instance of the model using walk$new. There are three required arguments:

model <- walk$new(list(sd = 1), 0, 20)
model

This returns an R6 object that can be used to simulate from or interact with the model. For example, our initial model state is

model$state()

Here there is one row per model state variable (there is only one here) and one column per particle (there are 20)

and we can run the model for 100 time steps, which returns the state at the end of the walk (and not at any intermediate times):

model$run(100)

We can also directly retrieve the state from our object

model$state()

At this point our particles have been run for 100 time steps with standard deviation 1 at each step so they will be distributed following Normal(0, 10). This is easier to see if we simulate a lot of particles, here 20,000:

model <- walk$new(list(sd = 1), 0, 20000)
invisible(model$run(100))
hist(model$state(), freq = FALSE, las = 1, col = "steelblue2", main = "",
     ylim = c(0., 0.04), xlab = "State")
curve(dnorm(x, 0, 10), col = "orange", add = TRUE, lwd = 2)

Running a model in parallel

The approach above still runs everything in serial, one particle after another. We can configure this system to run in parallel by providing the extra argument n_threads to the constructor.

Provided that your system can compile with openmp the following code will execute in parallel using 2 threads

model <- walk$new(list(sd = 1), 0, 20, n_threads = 2)
model$run(100)

Running this same code in series will give the same results:

model <- walk$new(list(sd = 1), 0, 20, n_threads = 1)
model$run(100)

We use as many random number generators as there are particles, so if you run fewer particles or more, increase the threads or decrease, the results will be the same (see vignette("design") for more on this).

You should be careful when selecting the number of threads. dust will never use more than one thread at a time without it being requested, but avoid using parallel::detectCores() to work out how many threads you have available as it will often return an overestimate. This is particularly the case in a shared-use system such as a cluster or CRAN's servers. We provide a helper function dust::dust_openmp_threads which can be used to try and find a safe number of threads available to you while respecting various environment variables which seek to control this (MC_CORES, OMP_THREAD_LIMIT, etc).

For example,

dust::dust_openmp_threads(100, "fix")

A more interesting example

Consider now an SIR model (Susceptible - Infected - Recovered). This sort of model is common in epidemiology, and is often extended to add additional compartments (e.g., SEIR which adds an Exposed compartment) or by structuring each compartment based on properties such as age. Here, we show a simple example with just 3 compartments:

cc_output(readLines(dust:::dust_file("examples/sir.cpp")))

There a few changes here compared to the version that was shown before for the walk model, but the core parts are only slightly different:

The other difference is a new template specialisation of dust_info - this is optional but can be used to report back to R information about the model based on the input parameters. Here it returns information about variable order and core parameters.

We also have added C++ pseudo-attributes with [[dust::param]] to describe the parameters that this function takes. This is optional, but allows use of a coef method with the class (see below).

As before, we'll use the version bundled with the package

sir <- dust::dust_example("sir")

To get information about the parameters, use coef() on the generator

coef(sir)

The model is initialised the same way as before:

model <- sir$new(list(), 0, 20)

We can use the $info() method to retrieve information about our model:

model$info()

and get its state as before:

model$state()

Because we have 5 states per particle, this is a 5 x 20 matrix.

Suppose that as we run the model we mostly want information on the "I" compartment. So we'll run the model for a bit and retrieve the number of infected individuals, then continue, etc. To do this we use the $set_index() method to indicate which state elements should be returned after using $run():

model$set_index(2L)

(you can pass any integer vector here, of any length, provided all the indices lie between 1 and the length of your state vector)

Now, when using run, the number of infected individuals is returned

model$run(10)
model$run(20)

This is useful when you have many compartments or variables that you are not that interested in during running the model. For a particle filter you might be fitting to the sum of all infected individuals over a number of compartments, so rather than returning hundreds of values back (times hundreds of particles) you can return back much less data and keep things nice and fast.

If the index vector is named, then those names will be used as row names on the returned object:

model$set_index(c(I = 2L))
model$run(30)

We can always use $state() to get the whole state vector:

model$state()

or select only variables of interest:

model$state(c(S = 1L, R = 3L))

Again, this copies names from the index if they are present.

In order to run the simulation beginning-to-end, we use the $simulate method on a dust object, which runs over a set of time steps and records the state at each.

model <- sir$new(list(), 0, 200)
model$set_index(2L)
times <- seq(0, 600, by = 5)
state <- model$simulate(times)

The output here is a 1 x 200 x 121 matrix (n state x n particles x n times)

dim(state)

we need to drop the first dimension and transpose the others for ease of plotting:

traces <- t(drop(state))

Plotting this over time (with 4 time steps per day - see the sir code above)

day <- times / 4
matplot(day, traces, type = "l", lty = 1, col = "#00000022",
        xlab = "Day", ylab = "Number infected (I)")
lines(day, rowMeans(traces), col = "red", lwd = 2)

Other methods

There are a few other methods on the dust objects that may be useful.

Reordering particles

This method exists to support particle filtering, and allows resampling or reordering of particles.

model <- walk$new(list(sd = 1), 0, 20)
model$run(1)

Suppose that we wanted to reorder these particles so that they were in decreasing order:

index <- order(model$state())
index

We then pass this index to the reorder method:

model$reorder(index)
model$state()

We can then continue our random walk. There is no need to sample every particle and particles can appear multiple times in the sample, but the total number must be conserved. Suppose that we want to to sample particles based on how close they are to 0:

p <- dnorm(model$state())
index <- sample(length(p), replace = TRUE , prob = p)
index

We can then apply this sampling:

model$reorder(index)
model$state()

This is not terribly useful on its own but is a key part of a particle filter.

When this reordering happens, only the model state is copied around; the internal data and random number state are left behind.

Set particle state

A particle state is determined by three mutable things; pars, state and time; these can all be updated for a model after it has been created. We have found setting one or more of these at a time important;

The update_state method allows setting any or all of these components.

By default every particle starts from the initial condition specified by your model classes initial() method. However, you can specify a state directly using the $update_state() method. Here, we initialise our SIR model with only 1 infected individual rather than 10:

model <- sir$new(list(), 0, 20)
model$update_state(state = c(1000, 1, 0, 0, 0))
model$state()

Now, when we run the model, far more of the epidemics fail to take off as the infected individual goes disappears before infecting anyone.

times <- seq(0, 600, by = 5)
state <- model$simulate(times)
day <- times / 4
matplot(day, t(state[2, , ]), type = "l", lty = 1, col = "#00000022",
        xlab = "Day", ylab = "Number infected (I)")

You can optionally set the initial time along with the state. This is useful if your model depends on time (e.g., you use the time step in a calculation by transforming it into some more meaningful measure of time).

You can also set the initial state to a range of different values. Suppose we set the initial number of infections to be Poisson distributed with a mean of 10, we might write:

I0 <- rpois(20, 10)
state0 <- rbind(1010 - I0, I0, 0, 0, 0, deparse.level = 0)
model$update_state(state = state0, time = 0L)
model$time()
model$state()

Reset the model

One particularly common case is to "reset" the model in order to run it again. you should not simply recreate the model from its constructor as that will re-seed a new set of random state -- it would be preferable continue with the generator state in your old model.

Suppose we run our model with sd of 1 for 100 time steps

model <- walk$new(list(sd = 1), 0, 10, seed = 1L)
y1 <- model$run(100)

we then use reset to set new parameters into the model and set the time back to zero and can run again

model$update_state(pars = list(sd = 2), time = 0)
y2 <- model$run(100)

The state created in y2 will have started from our new starting point and time zero, but have used the same random number state through both simulations, which is generally what we want.

Use within a package

You should not use dust::dust() within a package, because that would cause the model to compile each time you use it, rather than when the package builds. It may also cause issues when trying to use the model in parallel (e.g., with the parallel package). Instead you should use dust::dust_package() which will generate appropriate code for you.

To use dust in a package, put your dust models in inst/dust and run dust::dust_package() on the package's root directory.

desc <- c(
  "Package: example",
  "Title: Example Dust in a Package",
  "Version: 0.0.1",
  "LinkingTo: cpp11, dust",
  "Authors@R: c(person('A', 'Person', role = c('aut', 'cre')",
  "                     email = 'person@example.com'))",
  "License: CC0")
ns <- "useDynLib('example', .registration = TRUE)"

path <- tempfile()
dir.create(path)
dir.create(file.path(path, "inst/dust"), FALSE, TRUE)
writeLines(desc, file.path(path, "DESCRIPTION"))
writeLines(ns, file.path(path, "NAMESPACE"))
path_ex <- system.file("examples", package = "dust", mustWork = TRUE)
file.copy(file.path(path_ex, "walk.cpp"), file.path(path, "inst/dust"))

A skeleton package might contain:

withr::with_dir(path, fs::dir_tree())

This is the normal R package skeleton, though missing R and src directories (for now). The DESCRIPTION file contains

plain_output(readLines(file.path(path, "DESCRIPTION")))

The important things here are:

Our NAMESPACE file contains:

plain_output(readLines(file.path(path, "NAMESPACE")))

The files in inst/dust are the same files as seen above, with walk.cpp containing

cc_output(readLines(file.path(path, "inst/dust/walk.cpp")))

There can be as many of these files as you want within the directory inst/dust.

To prepare the package, run dust::dust_package():

dust::dust_package(path)

The directory structure now has more files:

withr::with_dir(path, fs::dir_tree())

The file src/walk.cpp are generated by dust and should not be edited. They include your model, but also a bit of helper code:

cc_output(readLines(file.path(path, "src/walk.cpp")))

The file R/dust.R contains the R interface generated by dust with the constructor objects (all models' constructors will be collected into this file, which also should not be edited).

r_output(readLines(file.path(path, "R/dust.R")))

Finally, R/cpp11.R and src/cpp11.cpp are files created by cpp11 that should not be edited.

Your package can include as much R code as you want, and can be developed like any other R package. But any time you change the code in inst/dust you should rerun dust::dust_package().



mrc-ide/dust documentation built on May 11, 2024, 1:08 p.m.