In mrc-ide/dust: Iterate Multiple Realisations of Stochastic Models

knitr::opts_chunk$set(
  error = FALSE,
  collapse = TRUE,
  comment = "#>"
)
lang_output <- function(x, lang) {
  cat(c(sprintf("```%s", lang), x, "```"), sep = "\n")
}
cc_output <- function(x) lang_output(x, "cc")
r_output <- function(x) lang_output(x, "r")
plain_output <- function(x) lang_output(x, "plain")

With the core approach in dust, you can run models in parallel efficiently up to the number of cores your workstation has available. Getting more than 32 or 64 cores is hard though, and dust provides no multi-node parallelism (e.g., MPI). Instead, we have developed a system for running dust models on GPUs (graphical processing units), specifically NVIDIA GPUs via CUDA (Compute Unified Device Architecture).

This vignette is written in reverse-order, starting with how to run a model on a GPU, before covering how to write a dust model that can be run on a GPU. The reason for this is that we do not expect that people will write these directly; instead we expect that most people will use the odin.dust package to generate these interfaces automatically, without having to write a single line of C++ or CUDA code.

Principles

A GPU model can be run on the CPU, and vice-versa.
The same random number generator sequence will be used in both cases.
Differences in results, if they exist, come from different precision/rounding between compute platforms. When using real_type = double we want to get the same results.
Just the model update can be defined for the GPU, and comparisons and shuffling could still happen on the CPU. Defining a comparison function is also possible, allowing a full particle filter run on a GPU.

Running a model with GPU support

The sirs model includes GPU support, and will be the focus of this vignette. However, the installed version cannot be run directly on the GPU for a couple of reasons:

this would complicate distribution of binaries as we'd depend on all systems having a copy of the CUDA runtime
we would have to compile support for many possible GPU architectures at once
it is very likely that we'd want to run the model in single precision (float) mode rather than double precision (double), and changing that requires re-compilation

In addition, you will also need:

CUDA toolkit v10.2 or higher, v11.1 or higher preferred (compile time and run time)
CUDA capable GPU (run time)
nvidia drivers (run time)

You can check with the command-line tool nvidia-smi if you have suitable hardware and drivers and with dust::dust_cuda_configuration(quiet = FALSE, forget = TRUE) if you have suitable build tools.

So here, rather than using dust::dust_example, we compile the model and pass in arguments:

path <- system.file("examples/sirs.cpp", package = "dust")
sirs <- dust::dust(path, gpu = TRUE, real_type = "float")

Notice in compilation that nvcc is used to compile the model, rather than g++ or clang++. The additional option -gencode=arch=compute_XX,code=sm_XX was added by dust and will include the CUDA compute version supported by the graphics cards found on the current system. You can use dust::dust_cuda_options to set additional options, passing in the return value for the gpu argument above.

Once compiled with GPU support, the static method has_gpu_support will report TRUE:

sirs$public_methods$has_gpu_support()

and the static method gpu_info will report on the GPU devices available:

sirs$public_methods$gpu_info()

If you have more than one GPU, the id in the devices section will be useful for targeting the correct device.

The object is initialised as usual but with slight differences:

you will probably want a (much) larger number of particles to take advantage of your GPU. As a rule of thumb we would suggest at least 10,000, but depending on model and card you may still see per-particle increases in compute speed as you use up to 1,000,000 particles. See below for more discussion of this.
the gpu_config argument needs to be provided to indicate which GPU device we are running on. Minimally this is an integer indicating the device that you want to use (on this machine the only option is 0), but you can also provide a list with elements device_id and run_block_size.

pars <- list()
n_particles <- 8192
model_gpu <- sirs$new(pars, 0, n_particles, gpu_config = 0L, seed = 1L)

Once initialised, a model can only be run on either the GPU or CPU, so we'll create a CPU version here for comparison:

model_cpu <- sirs$new(pars, 0, n_particles, seed = 1L)

By leaving gpu_config as NULL we indicate that the model should run on the CPU.

Once created, the uses_gpu method indicates if the model is set up to run on the GPU (rather than CPU):

model_gpu$uses_gpu()
model_cpu$uses_gpu()

Running the model on the CPU is fairly slow as this is a lot of particles and we're not running in parallel:

(t_cpu <- system.time(model_cpu$run(400)))

Running the model on the GPU however:

(t_gpu <- system.time(model_gpu$run(400)))

This is much faster! However, ~8,000 particles is unlikely to saturate a modern GPU and (overhead-aside) this will run about as quickly for potentially a hundred thousand particles. For example running 2^17 (131,072) particles only takes a little longer

model_large <- sirs$new(list(), 0, 2^17, gpu_config = 0L, seed = 1L)
(t_large <- system.time(model_large$run(400)))

ratio <- t_large[["elapsed"]] / t_cpu[["elapsed"]]

This is heaps faster, the GPU model ran in r round(ratio * 100, 1)% of the time as the CPU model but simulated r 2^17 / n_particles times as many particles (i.e., r round(2^17 / n_particles / ratio) times as fast per particle). With the relatively low times here, much of this time is just moving the data around, and with over a hundred thousand particles this is nontrivial. Of course, doing anything quickly with all these particles is its own problem.

All methods will automatically run on the GPU; this includes run, simulate, compare_data and filter. The last two are typically used from the mcstate interface.

Writing a GPU-capable model

The sirs model from above:

cc_output(readLines(path))

This is somewhat more complicated than the models described in vignette("dust.Rmd"). There are several important components required to run on the GPU.

Within the dust::gpu namespace, we declare the size of the shared parameters for the model. These are parameters that will be the same across all instances of a parameter set, as opposed to quantities that change between particles.

namespace dust {
namespace gpu {
template <>
size_t shared_int_size<sirs>(dust::shared_ptr<sirs> shared) {
  return 1;
}

template <>
size_t shared_real_size<sirs>(dust::shared_ptr<sirs> shared) {
  return 5;
}
}
}

These can be omitted if your model has no such parameters.

The next definition makes the above a bit clearer, it defines the method for copying these parameters

namespace dust {
namespace gpu {
template <>
void shared_copy<sirs>(dust::shared_ptr<sirs> shared,
                       int * shared_int,
                       sirs::real_type * shared_real) {
  using real_type = sirs::real_type;
  using dust::gpu::shared_copy_data;
  shared_real = shared_copy_data<real_type>(shared_real, shared->alpha);
  shared_real = shared_copy_data<real_type>(shared_real, shared->beta);
  shared_real = shared_copy_data<real_type>(shared_real, shared->gamma);
  shared_real = shared_copy_data<real_type>(shared_real, shared->dt);
  shared_real = shared_copy_data<real_type>(shared_real, shared->exp_noise);
  shared_int = shared_copy_data<int>(shared_int, shared->freq);
}
}
}

In the CPU version of the model we have a nice smart pointer to a struct (dust::shared_ptr<sirs>) from which we can access parameters by name (e.g., shared->alpha). No such niceties in CUDA where we need access to a single contiguous block of memory. The dust::shared_copy_data is a small utility to make the bookkeeping here a bit easier, but this could have been written out as:

namespace dust {
template <>
void shared_copy<sirs>(dust::shared_ptr<sirs> shared,
                       int * shared_int,
                       sirs::real_type * shared_real) {
  using real_type = sirs::real_type;
  shared_real[0] = shared->alpha;
  shared_real[1] = shared->beta;
  shared_real[2] = shared->gamma;
  shared_real[3] = shared->dt;
  shared_real[4] = shared->exp_noise;
  shared_int[0] = shared->freq;
}
}

The dust::shared_copy_data template has specialisations where the object being copied is a vector.

There are two methods that are not used here, but could be included, to define the size of per-particle internal storage space. This is required if your model needs to store intermediate calculations during the update step if those will not fit on the stack.

namespace dust {
namespace gpu {
template <>
size_t dust::internal_real_size<sirs>(dust::shared_ptr<sirs> shared) {
  return 0;
}
template <>
size_t dust::internal_int_size<sirs>(dust::shared_ptr<sirs> shared) {
  return 0;
}
}
}

Most interestingly we have the update_gpu method that actually does the update on the GPU

template <>
__device__
void update_gpu<sirs>(size_t time,
                      const dust::gpu::interleaved<sirs::real_type> state,
                      dust::gpu::interleaved<int> internal_int,
                      dust::gpu::interleaved<sirs::real_type> internal_real,
                      const int * shared_int,
                      const sirs::real_type * shared_real,
                      sirs::rng_state_type& rng_state,
                      dust::gpu::interleaved<sirs::real_type> state_next) {
  using real_type = sirs::real_type;
  const real_type alpha = shared_real[0];
  const real_type beta = shared_real[1];
  const real_type gamma = shared_real[2];
  const real_type dt = shared_real[3];
  const int freq = shared_int[0];
  const real_type S = state[0];
  const real_type I = state[1];
  const real_type R = state[2];
  const real_type N = S + I + R;
  const real_type p_SI = 1 - exp(- beta * I / N);
  const real_type p_IR = 1 - exp(- gamma);
  const real_type p_RS = 1 - exp(- alpha);
  const real_type n_SI = dust::random::binomial<real_type>(rng_state, S, p_SI * dt);
  const real_type n_IR = dust::random::binomial<real_type>(rng_state, I, p_IR * dt);
  const real_type n_RS = dust::random::binomial<real_type>(rng_state, R, p_RS * dt);
  state_next[0] = S - n_SI + n_RS;
  state_next[1] = I + n_SI - n_IR;
  state_next[2] = R + n_IR - n_RS;
  state_next[3] = (time % freq == 0) ? n_SI : state[3] + n_SI;
}

Note that this totally duplicates the logic from the CPU version (which was a method of the sirs object)

  void update(size_t time, const real_type * state, rng_state_type& rng_state,
              real_type * state_next) {
    real_type S = state[0];
    real_type I = state[1];
    real_type R = state[2];
    real_type N = S + I + R;

    real_type p_SI = 1 - exp(- shared->beta * I / N);
    real_type p_IR = 1 - exp(-(shared->gamma));
    real_type p_RS = 1 - exp(- shared->alpha);

    real_type dt = shared->dt;
    real_type n_SI = dust::random::binomial<real_type>(rng_state, S, p_SI * dt);
    real_type n_IR = dust::random::binomial<real_type>(rng_state, I, p_IR * dt);
    real_type n_RS = dust::random::binomial<real_type>(rng_state, R, p_RS * dt);

    state_next[0] = S - n_SI + n_RS;
    state_next[1] = I + n_SI - n_IR;
    state_next[2] = R + n_IR - n_RS;
    state_next[3] = (time % shared->freq == 0) ? n_SI : state[3] + n_SI;
  }

Differences include:

the presence of the four auxiliary data elements (internal_int, internal_real, shared_int and shared_real)
the data types that vary across particles are a special dust::gpu::interleaved<> type, which prevents slow uncoalesced reads from global memory on the GPU
All accesses into the shared_int and shared_real elements are now by position in the array, rather than the previous pointer/name based access
The __device__ annotation, which compiles the function for use on the GPU

Data comparison functions

Finally, if running a particle filter on the GPU, a version of the compare_data function is required that can run on the GPU:

template <>
__device__
sirs::real_type compare_gpu<sirs>(const dust::gpu::interleaved<sirs::real_type> state,
                                  const sirs::data_type& data,
                                  dust::gpu::interleaved<int> internal_int,
                                  dust::gpu::interleaved<sirs::real_type> internal_real,
                                  const int * shared_int,
                                  const sirs::real_type * shared_real,
                                  sirs::rng_state_type& rng_state) {
  using real_type = sirs::real_type;
  const real_type exp_noise = shared_real[4];
  const real_type incidence_modelled = state[3];
  const real_type incidence_observed = data.incidence;
  const real_type lambda = incidence_modelled +
    dust::random::exponential<real_type>(rng_state, exp_noise);
  return dust::density::poisson(incidence_observed, lambda, true);
}

This is very similar to the CPU version

  real_type compare_data(const real_type * state, const data_type& data,
                         rng_state_type& rng_state) {
    const real_type incidence_modelled = state[3];
    const real_type incidence_observed = data.incidence;
    const real_type lambda = incidence_modelled +
      dust::random::exponential<real_type>(rng_state, shared->exp_noise);
    return dust::density::poisson(incidence_observed, lambda, true);
  }

with similar differences to the update function:

argument types are different (interleaved types, internals and shared data passed explicitly)
Accessing of shared parameters is by position, not name
The __device__ annotation

Developing a GPU model

Debugging on a GPU is a pain, especially because there are typically many particles, and error recovery is not straightforward. In addition, most continuous integration systems do not provide GPUs, so testing your GPU code becomes difficult. To make this easier, dust allows running GPU code on the CPU - this will be typically slower than the CPU code, but allows easier debugging and verification that the model is behaving. We use this extensively in dust's tests and also in models built using dust that will run on the GPU.

To do this, compile the model with your preferred real type, but set the gpu argument to FALSE

sirs_cpu <- dust::dust(path, gpu = FALSE, real_type = "float")

Note that above the model is compiled with g++, not nvcc. However, the "GPU" code is still compiled into the model. We can then initialise this with and without a gpu_config argument

model1 <- sirs_cpu$new(pars, 0, 10, seed = 1L)
model2 <- sirs_cpu$new(pars, 0, 10, gpu_config = 0L, seed = 1L)

And run the models using the "GPU" code and the normal CPU code, but this time both on the CPU

model1$run(10)
model2$run(10)

These should be exactly the same, and this can form the basis of tests. Note that using float rather than double can throw up a few issues with algorithms, so it's worth checking with single-precision in your tests.

If you hit problems, then R -d cuda-gdb can work in place of the usual R -d gdb to work with a debugger, though because of the huge numbers of particles you will typically work with, debugging remains a challenge. Our usual strategy has been to try and recreate any issue purely in CPU code and debug as normal (see this blog post for hints on doing this effectively).

mrc-ide/dust documentation built on April 13, 2025, 12:06 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com