compute_SBC: Fit datasets and evaluate diagnostics and SBC metrics.
In hyunjimoon/SBC: Simulation Based Calibration for Bayesian models

compute_SBC

R Documentation

Fit datasets and evaluate diagnostics and SBC metrics.

Description

Performs the main SBC routine given datasets and a backend.

Usage

compute_SBC(
  datasets,
  backend,
  cores_per_fit = default_cores_per_fit(length(datasets)),
  keep_fits = TRUE,
  thin_ranks = SBC_backend_default_thin_ranks(backend),
  ensure_num_ranks_divisor = 2,
  chunk_size = default_chunk_size(length(datasets)),
  dquants = NULL,
  cache_mode = "none",
  cache_location = NULL,
  globals = list(),
  gen_quants = NULL
)

Arguments

`datasets`	an object of class `SBC_datasets`
`backend`	the model + sampling algorithm. The built-in backends can be constructed using `SBC_backend_cmdstan_sample()`, `SBC_backend_cmdstan_variational()`, `SBC_backend_rstan_sample()`, `SBC_backend_rstan_optimizing()` and `SBC_backend_brms()`. (more to come: issue 31, 38, 39). The backend is an S3 class supporting at least the `SBC_fit()`, `SBC_fit_to_draws_matrix()` methods.
`cores_per_fit`	how many cores should the backend be allowed to use for a single fit? Defaults to the maximum number that does not produce more parallel chains than you have cores. See `default_cores_per_fit()`.
`keep_fits`	boolean, when `FALSE` full fits are discarded from memory - reduces memory consumption and increases speed (when processing in parallel), but prevents you from inspecting the fits and using `recompute_SBC_statistics()`. We recommend to set to `TRUE` in early phases of workflow, when you run just a few fits. Once the model is stable and you want to run a lot of iterations, we recommend setting to `FALSE` (even for quite a simple model, 1000 fits can easily exhaust 32GB of RAM).
`thin_ranks`	how much thinning should be applied to posterior draws before computing ranks for SBC. Should be large enough to avoid any noticeable autocorrelation of the thinned draws See details below.
`ensure_num_ranks_divisor`	Potentially drop some posterior samples to ensure that this number divides the total number of SBC ranks (see Details).
`chunk_size`	How many simulations within the `datasets` shall be processed in one batch by the same worker. Relevant only when using parallel processing. The larger the value, the smaller overhead there will be for parallel processing, but the work may be distributed less equally across workers. We recommend setting this high enough that a single batch takes at least several seconds, i.e. for small models, you can often reduce computation time noticeably by increasing this value. You can use `options(SBC.min_chunk_size = value)` to set a minimum chunk size globally. See documentation of `future.chunk.size` argument for `future.apply::future_lapply()` for more details.
`dquants`	Derived quantities to include in SBC. Use `derived_quantities()` to construct them.
`cache_mode`	Type of caching of results, currently the only supported modes are `"none"` (do not cache) and `"results"` where the whole results object is stored and recomputed only when the hash of the backend or dataset changes.
`cache_location`	The filesystem location of cache. For `cache_mode = "results"` this should be a name of a single file. If the file name does not end with `.rds`, this extension is appended.
`globals`	A list of names of objects that are defined in the global environment and need to present for the backend to work ( if they are not already available in package). It is added to the `globals` argument to `future::future()`, to make those objects available on all workers.
`gen_quants`	Deprecated, use dquants instead

Value

An object of class SBC_results().

Paralellization

Parallel processing is supported via the future package, for most uses, it is most sensible to just call plan(multisession) once in your R session and all cores your computer will be used. For more details refer to the documentation of the future package.

Thinning

When using backends based on MCMC, there are two possible moments when draws may need to be thinned. They can be thinned directly within the backend and they may be thinned only to compute the ranks for SBC as specified by the thin_ranks argument. The main reason those are separate is that computing the ranks requires no or negligible autocorrelation while some autocorrelation may be easily tolerated for summarising the fit results or assessing convergence. In fact, thinning too aggressively in the backend may lead to overly noisy estimates of posterior means, quantiles and the posterior::rhat() and posterior::ess_tail() diagnostics. So for well-adapted Hamiltonian Monte-Carlo chains (e.g. Stan-based backends), we recommend no thinning in the backend and even value of thin_ranks between 6 and 10 is usually sufficient to remove the residual autocorrelation. For a backend based on Metropolis-Hastings, it might be sensible to thin quite aggressively already in the backend and then have some additional thinning via thin_ranks.

Backends that don't require thining should implement SBC_backend_iid_draws() or SBC_backend_default_thin_ranks() to avoid thinning by default.

Rank divisors

Some of the visualizations and post processing steps we use in the SBC package (e.g. plot_rank_hist(), empirical_coverage()) work best if the total number of possible SBC ranks is a "nice" number (lots of divisors). However, the number of ranks is one plus the number of posterior samples after thinning - therefore as long as the number of samples is a "nice" number, the number of ranks usually will not be. To remedy this, you can specify ensure_num_ranks_divisor - the method will drop at most ensure_num_ranks_divisor - 1 samples to make the number of ranks divisible by ensure_num_ranks_divisor. The default 2 prevents the most annoying pathologies while discarding at most a single sample.

hyunjimoon/SBC documentation built on Feb. 17, 2025, 3:25 a.m.