CmdStanR (Command Stan R) is a lightweight interface to Stan for R users that provides an alternative to the traditional RStan interface. See the Comparison with RStan section later in this vignette for more details on how the two interfaces differ.
Using CmdStanR requires installing the cmdstanr R package and also CmdStan, the command line interface to Stan. First we install the R package by running the following command in R.
# we recommend running this is a fresh R session or restarting your current session install.packages("cmdstanr", repos = c("https://mc-stan.org/r-packages/", getOption("repos")))
We can now load the package like any other R package. We'll also load the bayesplot and posterior packages to use later in examples.
library(cmdstanr) library(posterior) library(bayesplot) color_scheme_set("brightblue")
CmdStanR requires a working installation of CmdStan, the shell interface to Stan. If you don't have CmdStan installed then CmdStanR can install it for you, assuming you have a suitable C++ toolchain. The requirements are described in the CmdStan Guide:
To double check that your toolchain is set up properly you can call
the check_cmdstan_toolchain()
function:
check_cmdstan_toolchain()
If your toolchain is configured correctly then CmdStan can be installed by
calling the
install_cmdstan()
function:
if (!dir.exists(cmdstan_default_path())) { install_cmdstan() }
install_cmdstan(cores = 2)
Before CmdStanR can be used it needs to know where the CmdStan installation is located. When the package is loaded it tries to help automate this to avoid having to manually set the path every session:
If the environment variable "CMDSTAN"
exists at load time then its value
will be automatically set as the default path to CmdStan for the R session. This
is useful if your CmdStan installation is not located in the default directory
that would have been used by install_cmdstan()
(see #2).
If no environment variable is found when loaded but any directory in the form
".cmdstan/cmdstan-[version]"
, for example ".cmdstan/cmdstan-2.23.0"
,
exists in the user's home directory (Sys.getenv("HOME")
,
not the current working directory) then the path to the CmdStan with the
largest version number will be set as the path to CmdStan for the R session.
This is the same as the default directory that install_cmdstan()
uses to
install the latest version of CmdStan, so if that's how you installed CmdStan
you shouldn't need to manually set the path to CmdStan when loading CmdStanR.
If neither of these applies (or you want to subsequently change the path) you
can use the set_cmdstan_path()
function:
set_cmdstan_path(PATH_TO_CMDSTAN)
To check the path to the CmdStan installation and the CmdStan version number
you can use cmdstan_path()
and cmdstan_version()
:
cmdstan_path() cmdstan_version()
The cmdstan_model()
function creates a new
CmdStanModel
object from a file containing a Stan program. Under the hood, CmdStan is called
to translate a Stan program to C++ and create a compiled executable. Here we'll
use the example Stan program that comes with the CmdStan installation:
file <- file.path(cmdstan_path(), "examples", "bernoulli", "bernoulli.stan") mod <- cmdstan_model(file)
The object mod
is an R6 reference object of class
CmdStanModel
and
behaves similarly to R's reference class objects and those in object oriented
programming languages. Methods are accessed using the $
operator. This design
choice allows for CmdStanR and
CmdStanPy to provide a similar user
experience and share many implementation details.
The Stan program can be printed using the $print()
method:
mod$print()
The path to the compiled executable is returned by the $exe_file()
method:
mod$exe_file()
The
$sample()
method for
CmdStanModel
objects runs Stan's default MCMC algorithm. The data
argument accepts a named
list of R objects (like for RStan) or a path to a data file compatible with
CmdStan (JSON or R dump).
# names correspond to the data block in the Stan program data_list <- list(N = 10, y = c(0,1,0,0,0,0,0,0,0,1)) fit <- mod$sample( data = data_list, seed = 123, chains = 4, parallel_chains = 4, refresh = 500 # print update every 500 iters )
There are many more arguments that can be passed to the $sample()
method.
For details follow this link to its separate documentation page:
The $sample()
method creates R6 CmdStanMCMC
objects, which have many associated methods. Below we will demonstrate some of
the most important methods. For a full list, follow this link to the
CmdStanMCMC
documentation:
The
$summary()
method calls summarise_draws()
from the posterior package. The
first argument specifies the variables to summarize and any arguments
after that are passed on to posterior::summarise_draws()
to specify
which summaries to compute, whether to use multiple cores, etc.
fit$summary() fit$summary(variables = c("theta", "lp__"), "mean", "sd") # use a formula to summarize arbitrary functions, e.g. Pr(theta <= 0.5) fit$summary("theta", pr_lt_half = ~ mean(. <= 0.5)) # summarise all variables with default and additional summary measures fit$summary( variables = NULL, posterior::default_summary_measures(), extra_quantiles = ~posterior::quantile2(., probs = c(.0275, .975)) )
# NOTE: the hack of using print.data.frame in chunks with echo=FALSE # is used because the pillar formatting of posterior draws_summary objects # isn't playing nicely with pkgdown::build_articles(). options(digits = 2) print.data.frame(fit$summary()) print.data.frame(fit$summary(variables = c("theta", "lp__"), "mean", "sd")) print.data.frame(fit$summary("theta", pr_lt_half = ~ mean(. <= 0.5))) print.data.frame(fit$summary( variables = NULL, posterior::default_summary_measures(), extra_quantiles = ~posterior::quantile2(., probs = c(.0275, .975)) ))
CmdStan itself provides a stansummary
utility that can be called using the
$cmdstan_summary()
method. This method will print summaries but won't return
anything.
The $draws()
method can be used to extract the posterior draws in formats provided by the
posterior package. Here we demonstrate
only the draws_array
and draws_df
formats, but the posterior package
supports other useful formats as well.
# default is a 3-D draws_array object from the posterior package # iterations x chains x variables draws_arr <- fit$draws() # or format="array" str(draws_arr) # draws x variables data frame draws_df <- fit$draws(format = "df") str(draws_df) print(draws_df)
To convert an existing draws object to a different format use the
posterior::as_draws_*()
functions.
# this should be identical to draws_df created via draws(format = "df") draws_df_2 <- as_draws_df(draws_arr) identical(draws_df, draws_df_2)
In general, converting to a different draws format in this way will be slower
than just setting the appropriate format initially in the call to the $draws()
method, but in most cases the speed difference will be minor.
The vignette
Working with Posteriors
has more details on posterior draws, including how to reproduce the structured
output RStan users are accustomed to getting from rstan::extract()
.
Plotting posterior distributions is as easy as passing the object returned by
the $draws()
method directly to plotting functions in our
bayesplot package.
mcmc_hist(fit$draws("theta"))
The
$sampler_diagnostics()
method extracts the values of the sampler parameters (treedepth__
,
divergent__
, etc.) in formats supported by the posterior package. The
default is as a 3-D array (iteration x chain x variable).
# this is a draws_array object from the posterior package str(fit$sampler_diagnostics()) # this is a draws_df object from the posterior package str(fit$sampler_diagnostics(format = "df"))
The $diagnostic_summary()
method will display any sampler diagnostic warnings and return a summary of diagnostics for each chain.
fit$diagnostic_summary()
We see the number of divergences for each of the four chains, the number of times the maximum treedepth was hit for each chain, and the E-BFMI for each chain.
In this case there were no warnings, so in order to demonstrate the warning messages we'll use one of the CmdStanR example models that suffers from divergences.
fit_with_warning <- cmdstanr_example("schools")
After fitting there is a warning about divergences. We can also regenerate this warning message later using fit$diagnostic_summary()
.
diagnostics <- fit_with_warning$diagnostic_summary() print(diagnostics) # number of divergences reported in warning is the sum of the per chain values sum(diagnostics$num_divergent)
CmdStan itself provides a diagnose
utility that can be called using
the $cmdstan_diagnose()
method. This method will print warnings but won't return anything.
stanfit
objectIf you have RStan installed then it is also possible to create a stanfit
object from the csv output files written by CmdStan. This can be done by using
rstan::read_stan_csv()
in combination with the $output_files()
method of the
CmdStanMCMC
object. This is only needed if you want to fit a model with
CmdStanR but already have a lot of post-processing code that assumes a stanfit
object. Otherwise we recommend using the post-processing functionality provided
by CmdStanR itself.
stanfit <- rstan::read_stan_csv(fit$output_files())
CmdStanR also supports running Stan's optimization algorithms and its algorithms
for variational approximation of full Bayesian inference. These are run via the
$optimize()
, $laplace()
, $variational()
, and $pathfinder()
methods, which
are called in a similar way to the $sample()
method demonstrated above.
We can find the (penalized) maximum likelihood estimate (MLE) using $optimize()
.
fit_mle <- mod$optimize(data = data_list, seed = 123) fit_mle$print() # includes lp__ (log prob calculated by Stan program) fit_mle$mle("theta")
Here's a plot comparing the penalized MLE to the posterior distribution of
theta
.
mcmc_hist(fit$draws("theta")) + vline_at(fit_mle$mle("theta"), size = 1.5)
For optimization, by default the mode is calculated without the Jacobian
adjustment for constrained variables, which shifts the mode due to the change of
variables. To include the Jacobian adjustment and obtain a maximum a posteriori
(MAP) estimate set jacobian=TRUE
. See the
Maximum Likelihood Estimation
section of the CmdStan User's Guide for more details.
fit_map <- mod$optimize( data = data_list, jacobian = TRUE, seed = 123 )
The $laplace()
method produces a sample from a normal approximation centered at the mode of a
distribution in the unconstrained space. If the mode is a MAP estimate, the
samples provide an estimate of the mean and standard deviation of the posterior
distribution. If the mode is the MLE, the sample provides an estimate of the
standard error of the likelihood. Whether the mode is the MAP or MLE depends on
the value of the jacobian
argument when running optimization. See the
Laplace Sampling
chapter of the CmdStan User's Guide for more details.
Here we pass in the fit_map
object from above as the mode
argument. If
mode
is omitted then optimization will be run internally before taking draws
from the normal approximation.
fit_laplace <- mod$laplace( mode = fit_map, draws = 4000, data = data_list, seed = 123, refresh = 1000 ) fit_laplace$print("theta") mcmc_hist(fit_laplace$draws("theta"), binwidth = 0.025)
We can run Stan's experimental Automatic Differentiation Variational Inference
(ADVI) using the $variational()
method. For details on the ADVI algorithm see the
CmdStan User's Guide.
fit_vb <- mod$variational( data = data_list, seed = 123, draws = 4000 ) fit_vb$print("theta") mcmc_hist(fit_vb$draws("theta"), binwidth = 0.025)
Stan version 2.33 introduced a new variational method called Pathfinder,
which is intended to be faster and more stable than ADVI. For details on how
Pathfinder works see the section in the
CmdStan User's Guide.
Pathfinder is run using the $pathfinder()
method.
fit_pf <- mod$pathfinder( data = data_list, seed = 123, draws = 4000 ) fit_pf$print("theta")
Let's extract the draws, make the same plot we made after running the other algorithms, and compare them all. approximation, and compare them all. In this simple example the distributions are quite similar, but this will not always be the case for more challenging problems.
mcmc_hist(fit_pf$draws("theta"), binwidth = 0.025) + ggplot2::labs(subtitle = "Approximate posterior from pathfinder") + ggplot2::xlim(0, 1)
mcmc_hist(fit_vb$draws("theta"), binwidth = 0.025) + ggplot2::labs(subtitle = "Approximate posterior from variational") + ggplot2::xlim(0, 1)
mcmc_hist(fit_laplace$draws("theta"), binwidth = 0.025) + ggplot2::labs(subtitle = "Approximate posterior from Laplace") + ggplot2::xlim(0, 1)
mcmc_hist(fit$draws("theta"), binwidth = 0.025) + ggplot2::labs(subtitle = "Posterior from MCMC") + ggplot2::xlim(0, 1)
For more details on the $optimize()
, $laplace()
, $variational()
, and
pathfinder()
methods, follow these links to their documentation pages.
The $save_object()
method provided by CmdStanR is the most convenient way to save a fitted model object
to disk and ensure that all of the contents are available when reading the object back into R.
fit$save_object(file = "fit.RDS") # can be read back in using readRDS fit2 <- readRDS("fit.RDS")
But if your model object is large, then
$save_object()
could take a long time.
$save_object()
reads the CmdStan results files into memory, stores them in the model object,
and saves the object with saveRDS()
. To speed up the process, you can emulate
$save_object()
and replace saveRDS
with the much faster qsave()
function from the
qs
package.
# Load CmdStan output files into the fitted model object. fit$draws() # Load posterior draws into the object. try(fit$sampler_diagnostics(), silent = TRUE) # Load sampler diagnostics. try(fit$init(), silent = TRUE) # Load user-defined initial values. try(fit$profiles(), silent = TRUE) # Load profiling samples. # Save the object to a file. qs::qsave(x = fit, file = "fit.qs") # Read the object. fit2 <- qs::qread("fit.qs")
Storage is even faster if you discard results you do not need to save. The following example saves only posterior draws and discards sampler diagnostics, user-specified initial values, and profiling data.
# Load posterior draws into the fitted model object and omit other output. fit$draws() # Save the object to a file. qs::qsave(x = fit, file = "fit.qs") # Read the object. fit2 <- qs::qread("fit.qs")
See the vignette How does CmdStanR work? for more information about the composition of CmdStanR objects.
There are additional vignettes available that discuss other aspects of using CmdStanR. These can be found online at the CmdStanR website:
To ask a question please post on the Stan forums:
To report a bug, suggest a feature (including additions to these vignettes), or to start contributing to CmdStanR development (new contributors welcome!) please open an issue on GitHub:
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.