knitr::opts_chunk$set(
  error = FALSE,
  collapse = TRUE,
  comment = "#>"
)
lang_output <- function(x, lang) {
  cat(c(sprintf("```%s", lang), x, "```"), sep = "\n")
}
cc_output <- function(x) lang_output(x, "cc")
r_output <- function(x) lang_output(x, "r")
plain_output <- function(x) lang_output(x, "plain")

With the core approach in dust, you can run models in parallel efficiently up to the number of cores your workstation has available. Getting more than 32 or 64 cores is hard though, and dust provides no multi-node parallelism (e.g., MPI). Instead, we have developed a system for running dust models on GPUs (graphical processing units), specifically NVIDIA GPUs via CUDA (Compute Unified Device Architecture).

This vignette is written in reverse-order, starting with how to run a model on a GPU, before covering how to write a dust model that can be run on a GPU. The reason for this is that we do not expect that people will write these directly; instead we expect that most people will use the odin.dust package to generate these interfaces automatically, without having to write a single line of C++ or CUDA code.

Principles

Running a model with GPU support

The sirs model includes GPU support, and will be the focus of this vignette. However, the installed version cannot be run directly on the GPU for a couple of reasons:

In addition, you will also need:

You can check with the command-line tool nvidia-smi if you have suitable hardware and drivers and with dust::dust_cuda_configuration(quiet = FALSE, forget = TRUE) if you have suitable build tools.

So here, rather than using dust::dust_example, we compile the model and pass in arguments:

path <- system.file("examples/sirs.cpp", package = "dust")
sirs <- dust::dust(path, gpu = TRUE, real_type = "float")

Notice in compilation that nvcc is used to compile the model, rather than g++ or clang++. The additional option -gencode=arch=compute_XX,code=sm_XX was added by dust and will include the CUDA compute version supported by the graphics cards found on the current system. You can use dust::dust_cuda_options to set additional options, passing in the return value for the gpu argument above.

Once compiled with GPU support, the static method has_gpu_support will report TRUE:

sirs$public_methods$has_gpu_support()

and the static method gpu_info will report on the GPU devices available:

sirs$public_methods$gpu_info()

If you have more than one GPU, the id in the devices section will be useful for targeting the correct device.

The object is initialised as usual but with slight differences:

pars <- list()
n_particles <- 8192
model_gpu <- sirs$new(pars, 0, n_particles, gpu_config = 0L, seed = 1L)

Once initialised, a model can only be run on either the GPU or CPU, so we'll create a CPU version here for comparison:

model_cpu <- sirs$new(pars, 0, n_particles, seed = 1L)

By leaving gpu_config as NULL we indicate that the model should run on the CPU.

Once created, the uses_gpu method indicates if the model is set up to run on the GPU (rather than CPU):

model_gpu$uses_gpu()
model_cpu$uses_gpu()

Running the model on the CPU is fairly slow as this is a lot of particles and we're not running in parallel:

(t_cpu <- system.time(model_cpu$run(400)))

Running the model on the GPU however:

(t_gpu <- system.time(model_gpu$run(400)))

This is much faster! However, ~8,000 particles is unlikely to saturate a modern GPU and (overhead-aside) this will run about as quickly for potentially a hundred thousand particles. For example running 2^17 (131,072) particles only takes a little longer

model_large <- sirs$new(list(), 0, 2^17, gpu_config = 0L, seed = 1L)
(t_large <- system.time(model_large$run(400)))
ratio <- t_large[["elapsed"]] / t_cpu[["elapsed"]]

This is heaps faster, the GPU model ran in r round(ratio * 100, 1)% of the time as the CPU model but simulated r 2^17 / n_particles times as many particles (i.e., r round(2^17 / n_particles / ratio) times as fast per particle). With the relatively low times here, much of this time is just moving the data around, and with over a hundred thousand particles this is nontrivial. Of course, doing anything quickly with all these particles is its own problem.

All methods will automatically run on the GPU; this includes run, simulate, compare_data and filter. The last two are typically used from the mcstate interface.

Writing a GPU-capable model

The sirs model from above:

cc_output(readLines(path))

This is somewhat more complicated than the models described in vignette("dust.Rmd"). There are several important components required to run on the GPU.

Within the dust::gpu namespace, we declare the size of the shared parameters for the model. These are parameters that will be the same across all instances of a parameter set, as opposed to quantities that change between particles.

namespace dust {
namespace gpu {
template <>
size_t shared_int_size<sirs>(dust::shared_ptr<sirs> shared) {
  return 1;
}

template <>
size_t shared_real_size<sirs>(dust::shared_ptr<sirs> shared) {
  return 5;
}
}
}

These can be omitted if your model has no such parameters.

The next definition makes the above a bit clearer, it defines the method for copying these parameters

namespace dust {
namespace gpu {
template <>
void shared_copy<sirs>(dust::shared_ptr<sirs> shared,
                       int * shared_int,
                       sirs::real_type * shared_real) {
  using real_type = sirs::real_type;
  using dust::gpu::shared_copy_data;
  shared_real = shared_copy_data<real_type>(shared_real, shared->alpha);
  shared_real = shared_copy_data<real_type>(shared_real, shared->beta);
  shared_real = shared_copy_data<real_type>(shared_real, shared->gamma);
  shared_real = shared_copy_data<real_type>(shared_real, shared->dt);
  shared_real = shared_copy_data<real_type>(shared_real, shared->exp_noise);
  shared_int = shared_copy_data<int>(shared_int, shared->freq);
}
}
}

In the CPU version of the model we have a nice smart pointer to a struct (dust::shared_ptr<sirs>) from which we can access parameters by name (e.g., shared->alpha). No such niceties in CUDA where we need access to a single contiguous block of memory. The dust::shared_copy_data is a small utility to make the bookkeeping here a bit easier, but this could have been written out as:

namespace dust {
template <>
void shared_copy<sirs>(dust::shared_ptr<sirs> shared,
                       int * shared_int,
                       sirs::real_type * shared_real) {
  using real_type = sirs::real_type;
  shared_real[0] = shared->alpha;
  shared_real[1] = shared->beta;
  shared_real[2] = shared->gamma;
  shared_real[3] = shared->dt;
  shared_real[4] = shared->exp_noise;
  shared_int[0] = shared->freq;
}
}

The dust::shared_copy_data template has specialisations where the object being copied is a vector.

There are two methods that are not used here, but could be included, to define the size of per-particle internal storage space. This is required if your model needs to store intermediate calculations during the update step if those will not fit on the stack.

namespace dust {
namespace gpu {
template <>
size_t dust::internal_real_size<sirs>(dust::shared_ptr<sirs> shared) {
  return 0;
}
template <>
size_t dust::internal_int_size<sirs>(dust::shared_ptr<sirs> shared) {
  return 0;
}
}
}

Most interestingly we have the update_gpu method that actually does the update on the GPU

template <>
__device__
void update_gpu<sirs>(size_t time,
                      const dust::gpu::interleaved<sirs::real_type> state,
                      dust::gpu::interleaved<int> internal_int,
                      dust::gpu::interleaved<sirs::real_type> internal_real,
                      const int * shared_int,
                      const sirs::real_type * shared_real,
                      sirs::rng_state_type& rng_state,
                      dust::gpu::interleaved<sirs::real_type> state_next) {
  using real_type = sirs::real_type;
  const real_type alpha = shared_real[0];
  const real_type beta = shared_real[1];
  const real_type gamma = shared_real[2];
  const real_type dt = shared_real[3];
  const int freq = shared_int[0];
  const real_type S = state[0];
  const real_type I = state[1];
  const real_type R = state[2];
  const real_type N = S + I + R;
  const real_type p_SI = 1 - exp(- beta * I / N);
  const real_type p_IR = 1 - exp(- gamma);
  const real_type p_RS = 1 - exp(- alpha);
  const real_type n_SI = dust::random::binomial<real_type>(rng_state, S, p_SI * dt);
  const real_type n_IR = dust::random::binomial<real_type>(rng_state, I, p_IR * dt);
  const real_type n_RS = dust::random::binomial<real_type>(rng_state, R, p_RS * dt);
  state_next[0] = S - n_SI + n_RS;
  state_next[1] = I + n_SI - n_IR;
  state_next[2] = R + n_IR - n_RS;
  state_next[3] = (time % freq == 0) ? n_SI : state[3] + n_SI;
}

Note that this totally duplicates the logic from the CPU version (which was a method of the sirs object)

  void update(size_t time, const real_type * state, rng_state_type& rng_state,
              real_type * state_next) {
    real_type S = state[0];
    real_type I = state[1];
    real_type R = state[2];
    real_type N = S + I + R;

    real_type p_SI = 1 - exp(- shared->beta * I / N);
    real_type p_IR = 1 - exp(-(shared->gamma));
    real_type p_RS = 1 - exp(- shared->alpha);

    real_type dt = shared->dt;
    real_type n_SI = dust::random::binomial<real_type>(rng_state, S, p_SI * dt);
    real_type n_IR = dust::random::binomial<real_type>(rng_state, I, p_IR * dt);
    real_type n_RS = dust::random::binomial<real_type>(rng_state, R, p_RS * dt);

    state_next[0] = S - n_SI + n_RS;
    state_next[1] = I + n_SI - n_IR;
    state_next[2] = R + n_IR - n_RS;
    state_next[3] = (time % shared->freq == 0) ? n_SI : state[3] + n_SI;
  }

Differences include:

Data comparison functions

Finally, if running a particle filter on the GPU, a version of the compare_data function is required that can run on the GPU:

template <>
__device__
sirs::real_type compare_gpu<sirs>(const dust::gpu::interleaved<sirs::real_type> state,
                                  const sirs::data_type& data,
                                  dust::gpu::interleaved<int> internal_int,
                                  dust::gpu::interleaved<sirs::real_type> internal_real,
                                  const int * shared_int,
                                  const sirs::real_type * shared_real,
                                  sirs::rng_state_type& rng_state) {
  using real_type = sirs::real_type;
  const real_type exp_noise = shared_real[4];
  const real_type incidence_modelled = state[3];
  const real_type incidence_observed = data.incidence;
  const real_type lambda = incidence_modelled +
    dust::random::exponential<real_type>(rng_state, exp_noise);
  return dust::density::poisson(incidence_observed, lambda, true);
}

This is very similar to the CPU version

  real_type compare_data(const real_type * state, const data_type& data,
                         rng_state_type& rng_state) {
    const real_type incidence_modelled = state[3];
    const real_type incidence_observed = data.incidence;
    const real_type lambda = incidence_modelled +
      dust::random::exponential<real_type>(rng_state, shared->exp_noise);
    return dust::density::poisson(incidence_observed, lambda, true);
  }

with similar differences to the update function:

Developing a GPU model

Debugging on a GPU is a pain, especially because there are typically many particles, and error recovery is not straightforward. In addition, most continuous integration systems do not provide GPUs, so testing your GPU code becomes difficult. To make this easier, dust allows running GPU code on the CPU - this will be typically slower than the CPU code, but allows easier debugging and verification that the model is behaving. We use this extensively in dust's tests and also in models built using dust that will run on the GPU.

To do this, compile the model with your preferred real type, but set the gpu argument to FALSE

sirs_cpu <- dust::dust(path, gpu = FALSE, real_type = "float")

Note that above the model is compiled with g++, not nvcc. However, the "GPU" code is still compiled into the model. We can then initialise this with and without a gpu_config argument

model1 <- sirs_cpu$new(pars, 0, 10, seed = 1L)
model2 <- sirs_cpu$new(pars, 0, 10, gpu_config = 0L, seed = 1L)

And run the models using the "GPU" code and the normal CPU code, but this time both on the CPU

model1$run(10)
model2$run(10)

These should be exactly the same, and this can form the basis of tests. Note that using float rather than double can throw up a few issues with algorithms, so it's worth checking with single-precision in your tests.

If you hit problems, then R -d cuda-gdb can work in place of the usual R -d gdb to work with a debugger, though because of the huge numbers of particles you will typically work with, debugging remains a challenge. Our usual strategy has been to try and recreate any issue purely in CPU code and debug as normal (see this blog post for hints on doing this effectively).



mrc-ide/dust documentation built on May 11, 2024, 1:08 p.m.