In mrc-ide/dust: Iterate Multiple Realisations of Stochastic Models

knitr::opts_chunk$set(
  error = FALSE,
  collapse = TRUE,
  comment = "#>",
  fig.width = 7,
  fig.height = 5)
lang_output <- function(x, lang) {
  cat(c(sprintf("```%s", lang), x, "```"), sep = "\n")
}
cc_output <- function(x) lang_output(x, "cc")
r_output <- function(x) lang_output(x, "r")
plain_output <- function(x) lang_output(x, "plain")

Stochastic models can be used in statistical inference via methods such as particle filtering but in practice doing so requires that the models can be run over and over again very quickly. While R has excellent support for sampling from distributions, it is not necessarily well suited for this sort of problem because R is single threaded, so we are forced to evaluate realisations of the stochastic process in series (one after another) rather than in parallel.

The dust package provides tools to help write stochastic models that can be evaluated in parallel. It does not directly provide statistical methods; see the mcstate package for that. Instead, it focuses on providing:

a fast random number generator that can be evaluated in parallel (see vignette("rng") for details)
a way of wrapping a user-provided model, itself written as a C++ class
a lightweight interface that drives this model from R
a set of useful primitives for developing sequential Monte Carlo methods.

A simple example - random walk

Consider a unbiased random walk; at each time step we move our position with a draw from a normal distribution with mean 0 and some standard deviation. We consider a single individual moving but will eventually simulate a family of these individuals, each independent. All simulations have a sense of time - a unitless measure of time "step" will be used but it's up to you how it is interpreted (time is a non-negative integer, implemented using size_t).

Model code

To implement this model in dust we write out a C++ file that looks like:

cc_output(readLines(dust:::dust_file("examples/walk.cpp")))

There are two main parts here; the class (walk) and an interface function (dust::dust_pars).

The class is what does most of the work. It is important that this does not use anything in the R API as it may be called in parallel. Therefore use only C++ standard library types. It has to provide every public type or method as the example above. First, the five types:

the type real_type has to exist and indicate what sort of numeric type you use for "real numbers". Typically this will be double, but if you need a different type (typically float) you can use it here.
the type data_type describes data that the model may be compared with as it runs. Here we use dust::no_data as the model has no such data but see vignette("data") for the interface here.
the type shared_type is whatever read-only data your model needs that is shared across particles. Typically these reflect parameters of the model (see dust::dust_pars below), and while these will typically be shared across all particles that is is not always the case (see vignette("multi"))
the type internal_type is whatever internal data or space your model needs to run. This is read-write and per-particle (in contrast with shared_type above)
the type rng_state_type is one of the valid RNG types. Typically using rng_state_type = dust::random::generator<real_type>; will select a reasonable choice, but you could also force a specific generator such as dust::random::xoshiro256starstar here (see vignette("rng"))

The constructor must take only one argument, being const dust::pars_type<model>&. You can do what you want in the constructor - here we just copy shared (containing parameters) into the object with the pars argument (we use the private member shared here but this can be anything you want, though the type is important). The dust::pars_type<> template is a wrapper that contains two fields shared (as dust::shared_ptr<walk>) and internal (as internal_type).

The method size() must return the "size" of the system - how many state variables it has. Here we're considering a single individual moving randomly so the size is always 1. Many models will have a size that depends on their parameters, though the size cannot be changed after construction.

There are other forms of the dust::pars_type constructor which would allow specifying the internal data; typically this is used to create scratch space (allocating vectors as needed) rather than compute values because the scratch space and model state can become separated from each other (by direct setting of state, or by shuffling model state between particles).

Constructing a model

The function dust::dust will "compile" a model for us to use:

path_walk <- system.file("examples/walk.cpp", package = "dust")
walk <- dust::dust(path_walk)

However, this is also bundled into the package and can be loaded with:

walk <- dust::dust_example("walk")

The object itself is an "R6ClassGenerator" object:

walk

Create an instance of the model using walk$new. There are three required arguments:

pars: passed to dust_pars to initialise the model
time: the initial time (0 seems like a sensible choice here given our model has no internal sense of time)
n_particles: the number of particles to create

model <- walk$new(list(sd = 1), 0, 20)
model

This returns an R6 object that can be used to simulate from or interact with the model. For example, our initial model state is

model$state()

Here there is one row per model state variable (there is only one here) and one column per particle (there are 20)

and we can run the model for 100 time steps, which returns the state at the end of the walk (and not at any intermediate times):

model$run(100)

We can also directly retrieve the state from our object

model$state()

At this point our particles have been run for 100 time steps with standard deviation 1 at each step so they will be distributed following Normal(0, 10). This is easier to see if we simulate a lot of particles, here 20,000:

model <- walk$new(list(sd = 1), 0, 20000)
invisible(model$run(100))
hist(model$state(), freq = FALSE, las = 1, col = "steelblue2", main = "",
     ylim = c(0., 0.04), xlab = "State")
curve(dnorm(x, 0, 10), col = "orange", add = TRUE, lwd = 2)

Running a model in parallel

The approach above still runs everything in serial, one particle after another. We can configure this system to run in parallel by providing the extra argument n_threads to the constructor.

Provided that your system can compile with openmp the following code will execute in parallel using 2 threads

model <- walk$new(list(sd = 1), 0, 20, n_threads = 2)
model$run(100)

Running this same code in series will give the same results:

model <- walk$new(list(sd = 1), 0, 20, n_threads = 1)
model$run(100)

We use as many random number generators as there are particles, so if you run fewer particles or more, increase the threads or decrease, the results will be the same (see vignette("design") for more on this).

You should be careful when selecting the number of threads. dust will never use more than one thread at a time without it being requested, but avoid using parallel::detectCores() to work out how many threads you have available as it will often return an overestimate. This is particularly the case in a shared-use system such as a cluster or CRAN's servers. We provide a helper function dust::dust_openmp_threads which can be used to try and find a safe number of threads available to you while respecting various environment variables which seek to control this (MC_CORES, OMP_THREAD_LIMIT, etc).

For example,

dust::dust_openmp_threads(100, "fix")

A more interesting example

Consider now an SIR model (Susceptible - Infected - Recovered). This sort of model is common in epidemiology, and is often extended to add additional compartments (e.g., SEIR which adds an Exposed compartment) or by structuring each compartment based on properties such as age. Here, we show a simple example with just 3 compartments:

cc_output(readLines(dust:::dust_file("examples/sir.cpp")))

There a few changes here compared to the version that was shown before for the walk model, but the core parts are only slightly different:

the initial method is slightly more complicated as we initialise a 3-element vector
our shared_type object has significantly more elements
the dust_pars function is much more complicated
there is a compare_data method and a nontrivial data_type definition, but ignore these for now (see vignette("data"))

The other difference is a new template specialisation of dust_info - this is optional but can be used to report back to R information about the model based on the input parameters. Here it returns information about variable order and core parameters.

We also have added C++ pseudo-attributes with [[dust::param]] to describe the parameters that this function takes. This is optional, but allows use of a coef method with the class (see below).

As before, we'll use the version bundled with the package

sir <- dust::dust_example("sir")

To get information about the parameters, use coef() on the generator

coef(sir)

The model is initialised the same way as before:

model <- sir$new(list(), 0, 20)

We can use the $info() method to retrieve information about our model:

model$info()

and get its state as before:

model$state()

Because we have 5 states per particle, this is a 5 x 20 matrix.

Suppose that as we run the model we mostly want information on the "I" compartment. So we'll run the model for a bit and retrieve the number of infected individuals, then continue, etc. To do this we use the $set_index() method to indicate which state elements should be returned after using $run():

model$set_index(2L)

(you can pass any integer vector here, of any length, provided all the indices lie between 1 and the length of your state vector)

Now, when using run, the number of infected individuals is returned

model$run(10)
model$run(20)

This is useful when you have many compartments or variables that you are not that interested in during running the model. For a particle filter you might be fitting to the sum of all infected individuals over a number of compartments, so rather than returning hundreds of values back (times hundreds of particles) you can return back much less data and keep things nice and fast.

If the index vector is named, then those names will be used as row names on the returned object:

model$set_index(c(I = 2L))
model$run(30)

We can always use $state() to get the whole state vector:

model$state()

or select only variables of interest:

model$state(c(S = 1L, R = 3L))

Again, this copies names from the index if they are present.

In order to run the simulation beginning-to-end, we use the $simulate method on a dust object, which runs over a set of time steps and records the state at each.

model <- sir$new(list(), 0, 200)
model$set_index(2L)
times <- seq(0, 600, by = 5)
state <- model$simulate(times)

The output here is a 1 x 200 x 121 matrix (n state x n particles x n times)

dim(state)

we need to drop the first dimension and transpose the others for ease of plotting:

traces <- t(drop(state))

Plotting this over time (with 4 time steps per day - see the sir code above)

day <- times / 4
matplot(day, traces, type = "l", lty = 1, col = "#00000022",
        xlab = "Day", ylab = "Number infected (I)")
lines(day, rowMeans(traces), col = "red", lwd = 2)

Other methods

There are a few other methods on the dust objects that may be useful.

Reordering particles

This method exists to support particle filtering, and allows resampling or reordering of particles.

model <- walk$new(list(sd = 1), 0, 20)
model$run(1)

Suppose that we wanted to reorder these particles so that they were in decreasing order:

index <- order(model$state())
index

We then pass this index to the reorder method:

model$reorder(index)
model$state()

We can then continue our random walk. There is no need to sample every particle and particles can appear multiple times in the sample, but the total number must be conserved. Suppose that we want to to sample particles based on how close they are to 0:

p <- dnorm(model$state())
index <- sample(length(p), replace = TRUE , prob = p)
index

We can then apply this sampling:

model$reorder(index)
model$state()

This is not terribly useful on its own but is a key part of a particle filter.

When this reordering happens, only the model state is copied around; the internal data and random number state are left behind.

Set particle state

A particle state is determined by three mutable things; pars, state and time; these can all be updated for a model after it has been created. We have found setting one or more of these at a time important;

Resetting the model with a new set of parameters (pars), initial conditions (state) and times (time)
Changing pars at some point in the simulation to introduce some new aspect of the model
Changing state to manually move around some individuals within a model
Setting time along with state when initialising the model from a previously saved state

The update_state method allows setting any or all of these components.

By default every particle starts from the initial condition specified by your model classes initial() method. However, you can specify a state directly using the $update_state() method. Here, we initialise our SIR model with only 1 infected individual rather than 10:

model <- sir$new(list(), 0, 20)
model$update_state(state = c(1000, 1, 0, 0, 0))
model$state()

Now, when we run the model, far more of the epidemics fail to take off as the infected individual goes disappears before infecting anyone.

times <- seq(0, 600, by = 5)
state <- model$simulate(times)
day <- times / 4
matplot(day, t(state[2, , ]), type = "l", lty = 1, col = "#00000022",
        xlab = "Day", ylab = "Number infected (I)")

You can optionally set the initial time along with the state. This is useful if your model depends on time (e.g., you use the time step in a calculation by transforming it into some more meaningful measure of time).

You can also set the initial state to a range of different values. Suppose we set the initial number of infections to be Poisson distributed with a mean of 10, we might write:

I0 <- rpois(20, 10)
state0 <- rbind(1010 - I0, I0, 0, 0, 0, deparse.level = 0)
model$update_state(state = state0, time = 0L)
model$time()
model$state()

Reset the model

One particularly common case is to "reset" the model in order to run it again. you should not simply recreate the model from its constructor as that will re-seed a new set of random state -- it would be preferable continue with the generator state in your old model.

Suppose we run our model with sd of 1 for 100 time steps

model <- walk$new(list(sd = 1), 0, 10, seed = 1L)
y1 <- model$run(100)

we then use reset to set new parameters into the model and set the time back to zero and can run again

model$update_state(pars = list(sd = 2), time = 0)
y2 <- model$run(100)

The state created in y2 will have started from our new starting point and time zero, but have used the same random number state through both simulations, which is generally what we want.

Use within a package

You should not use dust::dust() within a package, because that would cause the model to compile each time you use it, rather than when the package builds. It may also cause issues when trying to use the model in parallel (e.g., with the parallel package). Instead you should use dust::dust_package() which will generate appropriate code for you.

To use dust in a package, put your dust models in inst/dust and run dust::dust_package() on the package's root directory.

desc <- c(
  "Package: example",
  "Title: Example Dust in a Package",
  "Version: 0.0.1",
  "LinkingTo: cpp11, dust",
  "Authors@R: c(person('A', 'Person', role = c('aut', 'cre')",
  "                     email = 'person@example.com'))",
  "License: CC0")
ns <- "useDynLib('example', .registration = TRUE)"

path <- tempfile()
dir.create(path)
dir.create(file.path(path, "inst/dust"), FALSE, TRUE)
writeLines(desc, file.path(path, "DESCRIPTION"))
writeLines(ns, file.path(path, "NAMESPACE"))
path_ex <- system.file("examples", package = "dust", mustWork = TRUE)
file.copy(file.path(path_ex, "walk.cpp"), file.path(path, "inst/dust"))

A skeleton package might contain:

withr::with_dir(path, fs::dir_tree())

This is the normal R package skeleton, though missing R and src directories (for now). The DESCRIPTION file contains

plain_output(readLines(file.path(path, "DESCRIPTION")))

The important things here are:

the package name (Package). We're using example, and names with a dot may not work as expected
including cpp11 and dust in LinkingTo, which allows package compilation to find their respective header files
a useDynLib call to your package in the NAMESPACE file

Our NAMESPACE file contains:

plain_output(readLines(file.path(path, "NAMESPACE")))

The files in inst/dust are the same files as seen above, with walk.cpp containing

cc_output(readLines(file.path(path, "inst/dust/walk.cpp")))

There can be as many of these files as you want within the directory inst/dust.

To prepare the package, run dust::dust_package():

dust::dust_package(path)

The directory structure now has more files:

withr::with_dir(path, fs::dir_tree())

The file src/walk.cpp are generated by dust and should not be edited. They include your model, but also a bit of helper code:

cc_output(readLines(file.path(path, "src/walk.cpp")))

The file R/dust.R contains the R interface generated by dust with the constructor objects (all models' constructors will be collected into this file, which also should not be edited).

r_output(readLines(file.path(path, "R/dust.R")))

Finally, R/cpp11.R and src/cpp11.cpp are files created by cpp11 that should not be edited.

Your package can include as much R code as you want, and can be developed like any other R package. But any time you change the code in inst/dust you should rerun dust::dust_package().

mrc-ide/dust documentation built on April 13, 2025, 12:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com