cmdstan_glm: Bayesian generalized linear models via CmdStan

View source: R/cmdstan_glm.R

cmdstan_glmR Documentation

Bayesian generalized linear models via CmdStan

Description

Generalized linear modeling for Gaussian and gamma responses, with optional prior distributions for the coefficients, intercept, and auxiliary parameters.

Usage

cmdstan_glm(
  formula,
  family = gaussian(),
  data,
  weights,
  subset,
  na.action = NULL,
  offset = NULL,
  model = TRUE,
  algorithm = c("sampling", "meanfield", "fullrank"),
  x = FALSE,
  y = TRUE,
  contrasts = NULL,
  out_dir = NULL,
  ...,
  prior = default_prior_coef(family),
  prior_intercept = default_prior_intercept(family),
  prior_aux = exponential(autoscale = TRUE),
  prior_PD = FALSE,
  mean_PPD = !prior_PD,
  sparse = FALSE
)

Arguments

formula, data, subset

Same as glm, but we strongly advise against omitting the data argument.

family

Same as glm. Only continuous data can be handled by this function, so the families are gaussian(), Gamma(), inverse.gaussian(), and Beta regression (via mgcv::betar), with any (allowed) link function.

na.action, contrasts

Same as glm, but rarely specified.

model, offset, weights

Same as glm.

algorithm

Argument "sampling" is for MCMC (default), while "meanfield" and "fullrank" are variational algorithms ("meanfield" is the CmdStan default).

x

Logical scalar indicating whether to return the design matrix.

y

Logical scalar indicating whether to return the response vector.

out_dir

Output directory for model fit environment.

...

Further arguments passed to cmdstanr::sample (i.e., refresh, iter_warmup, iter_sampling, chains etc.)

prior

The prior distribution for the (non-hierarchical) regression coefficients.

The default priors are described in the vignette Prior Distributions for rstanarm Models. If not using the default, prior should be a call to one of the various functions provided by rstanarm for specifying priors. The subset of these functions that can be used for the prior on the coefficients can be grouped into several "families":

Family Functions
Student t family normal, student_t, cauchy
Hierarchical shrinkage family hs, hs_plus
Laplace family laplace, lasso
Product normal family product_normal

See [http://mc-stan.org/rstanarm/reference/priors.html](here) for details on the families and how to specify the arguments for all of the functions in the table above. To omit a prior —i.e., to use a flat (improper) uniform prior— prior can be set to NULL, although this is rarely a good idea.

Note: Unless QR=TRUE, if prior is from the Student t family or Laplace family, and if the autoscale argument to the function used to specify the prior is left at its default and recommended value of TRUE, then the default or user-specified prior scale(s) may be adjusted internally based on the scales of the predictors.

prior_intercept

The prior distribution for the intercept (after centering all predictors, see note below).

The default prior is described in the vignette Prior Distributions for rstanarm Models. If not using the default, prior_intercept can be a call to normal, student_t or cauchy. To omit a prior on the intercept —i.e., to use a flat (improper) uniform prior— prior_intercept can be set to NULL.

Note: If using a dense representation of the design matrix —i.e., if the sparse argument is left at its default value of FALSE— then the prior distribution for the intercept is set so it applies to the value when all predictors are centered (you don't need to manually center them). This is explained further in the vignette Prior Distributions for rstanarm Models. If you prefer to specify a prior on the intercept without the predictors being auto-centered, then you have to omit the intercept from the formula and include a column of ones as a predictor, in which case some element of prior specifies the prior on it, rather than prior_intercept. Regardless of how prior_intercept is specified, the reported estimates of the intercept always correspond to a parameterization without centered predictors (i.e., same as in glm).

prior_aux

The prior distribution for the "auxiliary" parameter (if applicable). The "auxiliary" parameter refers to a different parameter depending on the family. For Gaussian models prior_aux controls "sigma", the error standard deviation. For gamma models prior_aux sets the prior on to the "shape" parameter (see e.g., rgamma), and for inverse-Gaussian models it is the so-called "lambda" parameter (which is essentially the reciprocal of a scale parameter).

The default prior is described in the vignette Prior Distributions for rstanarm Models. If not using the default, prior_aux can be a call to exponential to use an exponential distribution, or normal, student_t or cauchy, which results in a half-normal, half-t, or half-Cauchy prior. See here for details on these functions. To omit a prior —i.e., to use a flat (improper) uniform prior— set prior_aux to NULL.

prior_PD

A logical scalar (defaulting to FALSE) indicating whether to draw from the prior predictive distribution instead of conditioning on the outcome.

mean_PPD

A logical value indicating whether the sample mean of the posterior predictive distribution of the outcome should be calculated in the generated quantities block. If TRUE then mean_PPD is computed and displayed as a diagnostic in the printed output. A useful heuristic is to check if mean_PPD is plausible when compared to mean(y). If it is plausible then this does not mean that the model is good in general (only that it can reproduce the sample mean), but if mean_PPD is implausible then there may be something wrong, e.g., severe model misspecification, problems with the data and/or priors, computational issues, etc.

sparse

A logical scalar (defaulting to FALSE) indicating whether to use a sparse representation of the design (X) matrix. If TRUE, the the design matrix is not centered (since that would destroy the sparsity) and likewise it is not possible to specify both QR = TRUE and sparse = TRUE. Depending on how many zeros there are in the design matrix, setting sparse = TRUE may make the code run faster and can consume much less RAM.

Details

The cmdstan_glm function is similar in syntax to glm but rather than performing maximum likelihood estimation of generalized linear models, full Bayesian estimation is performed (if algorithm is "sampling") via MCMC. The Bayesian model adds priors (independent by default) on the coefficients of the GLM.

Value

A cmdstanr::CmdStanMCMC() object.

Examples

## Not run: 
#' # Linear regression
mtcars$mpg10 <- mtcars$mpg / 10
fit <- cmdstan_glm(
  mpg10 ~ wt + cyl + am,
  data = mtcars,
  # for speed of example only (default is "sampling")
  algorithm = "fullrank",
  refresh = 0
 )

## End(Not run)


qdercon/pstpipeline documentation built on June 1, 2025, 1:11 p.m.