stan_glm: Generalized linear models
In geostan: Bayesian Spatial Analysis

stan_glm

R Documentation

Generalized linear models

Description

Fit a generalized linear model.

Usage

stan_glm(
  formula,
  slx,
  re,
  data,
  C,
  family = gaussian(),
  prior = NULL,
  ME = NULL,
  centerx = FALSE,
  prior_only = FALSE,
  censor_point,
  chains = 4,
  iter = 2000,
  refresh = 1000,
  keep_all = FALSE,
  slim = FALSE,
  drop = NULL,
  pars = NULL,
  control = NULL,
  quiet = FALSE,
  ...
)

Arguments

`formula`	A model formula, following the R formula syntax. Binomial models are specified by setting the left hand side of the equation to a data frame of successes and failures, as in `cbind(successes, failures) ~ x`.
`slx`	Formula to specify any spatially-lagged covariates. As in, `~ x1 + x2` (the intercept term will be removed internally). When setting priors for `beta`, remember to include priors for any SLX terms.
`re`	To include a varying intercept (or "random effects") term, `alpha_re`, specify the grouping variable here using formula syntax, as in `~ ID`. Then, `alpha_re` is a vector of parameters added to the linear predictor of the model, and: alpha_re ~ N(0, alpha_tau) alpha_tau ~ Student_t(d.f., location, scale).
`data`	A `data.frame` or an object coercible to a data frame by `as.data.frame` containing the model data.
`C`	Spatial connectivity matrix which will be used to calculate residual spatial autocorrelation as well as any user specified `slx` terms. See `shape2mat`.
`family`	The likelihood function for the outcome variable. Current options are `poisson(link = "log")`, `binomial(link = "logit")`, `student_t()`, and the default `gaussian()`.
`prior`	A named list of parameters for prior distributions (see `priors`): intercept The intercept is assigned a Gaussian prior distribution (see `normal` . beta Regression coefficients are assigned Gaussian prior distributions. Variables must follow their order of appearance in the model `formula`. Note that if you also use `slx` terms (spatially lagged covariates), and you use custom priors for `beta`, then you have to provide priors for the slx terms. Since slx terms are prepended to the design matrix, the prior for the slx term will be listed first. sigma For `family = gaussian()` and `family = student_t()` models, the scale parameter, `sigma`, is assigned a (half-) Student's t prior distribution. The half-Student's t prior for `sigma` is constrained to be positive. nu `nu` is the degrees of freedom parameter in the Student's t likelihood (only used when `family = student_t()`). `nu` is assigned a gamma prior distribution. The default prior is `prior = list(nu = gamma2(alpha = 3, beta = 0.2))`. tau The scale parameter for random effects, or varying intercepts, terms. This scale parameter, `tau`, is assigned a half-Student's t prior. To set this, use, e.g., `prior = list(tau = student_t(df = 20, location = 0, scale = 20))`.
`ME`	To model observational uncertainty (i.e. measurement or sampling error) in any or all of the covariates, provide a list of data as constructed by the `prep_me_data` function.
`centerx`	To center predictors on their mean values, use `centerx = TRUE`. If the ME argument is used, the modeled covariate (i.e., latent variable), rather than the raw observations, will be centered. When using the ME argument, this is the recommended method for centering the covariates.
`prior_only`	Draw samples from the prior distributions of parameters only.
`censor_point`	Integer value indicating the maximum censored value; this argument is for modeling censored (suppressed) outcome data, typically disease case counts or deaths. For example, the US Centers for Disease Control and Prevention censors (does not report) death counts that are nine or fewer, so if you're using CDC WONDER mortality data you could provide `censor_point = 9`.
`chains`	Number of MCMC chains to estimate.
`iter`	Number of samples per chain.
`refresh`	Stan will print the progress of the sampler every `refresh` number of samples; set `refresh=0` to silence this.
`keep_all`	If `keep_all = TRUE` then samples for all parameters in the Stan model will be kept; this is required if you want to do model comparison with Bayes factors and the `bridgesampling` package.
`slim`	If `slim = TRUE`, then the Stan model will not collect the most memory-intensive parameters (including n-length vectors of fitted values, log-likelihoods, and ME-modeled covariate values). This will disable many convenience functions that are otherwise available for fitted `geostan` models, such as the extraction of residuals, fitted values, and spatial trends, WAIC, and spatial diagnostics, and ME diagnostics; many quantities of interest, such as fitted values and spatial trends, can still be calculated manually using given parameter estimates. The "slim" option is designed for data-intensive routines, such as regression with raster data, Monte Carlo studies, and measurement error models. For more control over which parameters are kept or dropped, use the `drop` argument instead of `slim`.
`drop`	Provide a vector of character strings to specify the names of any parameters that you do not want MCMC samples for. Dropping parameters in this way can improve sampling speed and reduce memory usage. The following parameter vectors can potentially be dropped from GLM models: 'fitted' The N-length vector of fitted values 'alpha_re' Vector of 'random effects'/varying intercepts. 'x_true' N-length vector of 'latent'/modeled covariate values created for measurement error (ME) models. Using `drop = c('fitted', 'alpha_re', 'x_true')` is equivalent to `slim = TRUE`. If `slim = TRUE`, then `drop` will be ignored.
`pars`	Specify any additional parameters you'd like stored from the Stan model.
`control`	A named list of parameters to control the sampler's behavior. See stan for details.
`quiet`	Controls (most) automatic printing to the console. By default, any prior distributions that have not been assigned by the user are printed to the console. If `quiet = TRUE`, these will not be printed. Using `quiet = TRUE` will also force `refresh = 0`.
`...`	Other arguments passed to sampling.

Details

Fit a generalized linear model using the R formula interface. Default prior distributions are designed to be weakly informative relative to the data. Much of the functionality intended for spatial models, such as the ability to add spatially lagged covariates and observational error models, are also available in stan_glm. All of geostan's spatial models build on top of the same Stan code used in stan_glm.

Spatially lagged covariates (SLX)

The slx argument is a convenience function for including SLX terms. For example,

y = W X \gamma + X \beta + \epsilon

where W is a row-standardized spatial weights matrix (see shape2mat), WX is the mean neighboring value of X, and \gamma is a coefficient vector. This specifies a regression with spatially lagged covariates. SLX terms can specified by providing a formula to the slx argument:

stan_glm(y ~ x1 + x2, slx = ~ x1 + x2, \...),

which is a shortcut for

stan_glm(y ~ I(W \%*\% x1) + I(W \%*\% x2) + x1 + x2, \...)

SLX terms will always be prepended to the design matrix, as above, which is important to know when setting prior distributions for regression coefficients.

For measurement error (ME) models, the SLX argument is the only way to include spatially lagged covariates since the SLX term needs to be re-calculated on each iteration of the MCMC algorithm.

Measurement error (ME) models

The ME models are designed for surveys with spatial sampling designs, such as the American Community Survey (ACS) estimates. For a tutorial, see vignette("spatial-me-models", package = "geostan").

Given estimates x, their standard errors s, and the target quantity of interest (i.e., the unknown true value) z, the ME models have one of the the following two specifications, depending on the user input. If a spatial CAR model is specified, then:

x \sim Gauss(z, s^2)

z \sim Gauss(\mu_z, \Sigma_z)

\Sigma_z = (I - \rho C)^{-1} M

\mu_z \sim Gauss(0, 100)

\tau_z \sim Student(10, 0, 40), \tau > 0

\rho_z \sim uniform(l, u)

where \Sigma specifies the covariance matrix of a spatial conditional autoregressive (CAR) model with scale parameter \tau (on the diagonal of M), autocorrelation parameter \rho, and l, u are the lower and upper bounds that \rho is permitted to take (which is determined by the extreme eigenvalues of the spatial connectivity matrix C). M contains the inverse of the row sums of C on its diagonal multiplied by \tau (following the "WCAR" specification).

For non-spatial ME models, the following is used instead:

x \sim Gauss(z, s^2)

z \sim student_t(\nu_z, \mu_z, \sigma_z)

\nu_z \sim gamma(3, 0.2)

\mu_z \sim Gauss(0, 100)

\sigma_z \sim student(10, 0, 40)

For strongly skewed variables, such as census tract poverty rates, it can be advantageous to apply a logit transformation to z before applying the CAR or Student-t prior model. When the logit argument is used, the first two lines of the model specification become:

x \sim Gauss(z, s^2)

logit(z) \sim Gauss(\mu_z, \Sigma_z)

and similarly for the Student t model:

x \sim Gauss(z, s^2)

logit(z) \sim student(\nu_z, \mu_z, \sigma_z)

Missing data

For most geostan models, missing (NA) observations are allowed in the outcome variable. However, there cannot be any missing covariate data. Models that can handle missing data are: any Poisson or binomial model (GLM, SAR, CAR, ESF, ICAR), all GLMs and ESF models. The only models that cannot handle missing outcome data are the CAR and SAR models when the outcome is a continuous variable (auto-normal/Gaussian models).

When observations are missing, they will simply be ignored when calculating the likelihood in the MCMC sampling process (reflecting the absence of information). The estimated model parameters (including any covariates and spatial trend) will then be used to produce estimates or fitted values for the missing observations. The fitted and posterior_predict functions will work as normal in this case, and return values for all rows in your data.

Censored counts

Vital statistics systems and disease surveillance programs typically suppress case counts when they are smaller than a specific threshold value. In such cases, the observation of a censored count is not the same as a missing value; instead, you are informed that the value is an integer somewhere between zero and the threshold value. For Poisson models (⁠family = poisson())⁠), you can use the censor_point argument to encode this information into your model.

Internally, geostan will keep the index values of each censored observation, and the index value of each of the fully observed outcome values. For all observed counts, the likelihood statement will be:

p(y_i | data, model) = poisson(y_i | \mu_i),

as usual, where \mu_i may include whatever spatial terms are present in the model.

For each censored count, the likelihood statement will equal the cumulative Poisson distribution function for values zero through the censor point:

p(y_i | data, model) = \sum_{m=0}^{M} Poisson( m | \mu_i),

where M is the censor point and \mu_i again is the fitted value for the i^{th} observation.

For example, the US Centers for Disease Control and Prevention's CDC WONDER database censors all death counts between 0 and 9. To model CDC WONDER mortality data, you could provide censor_point = 9 and then the likelihood statement for censored counts would equal the summation of the Poisson probability mass function over each integer ranging from zero through 9 (inclusive), conditional on the fitted values (i.e., all model parameters). See Donegan (2021) for additional discussion, references, and Stan code.

Value

An object of class class geostan_fit (a list) containing:

summary: Summaries of the main parameters of interest; a data frame
diagnostic: Residual spatial autocorrelation as measured by the Moran coefficient.
stanfit: an object of class stanfit returned by rstan::stan
data: a data frame containing the model data
family: the user-provided or default family argument used to fit the model
formula: The model formula provided by the user (not including ESF component)
slx: The slx formula
C: The spatial weights matrix, if one was provided by the user.
re: A list containing re, the random effects (varying intercepts) formula if provided, and Data a data frame with columns id, the grouping variable, and idx, the index values assigned to each group.
priors: Prior specifications.
x_center: If covariates are centered internally (centerx = TRUE), then x_center is a numeric vector of the values on which covariates were centered.
ME: The ME data list, if one was provided by the user for measurement error models.
spatial: NA, slot is maintained for use in geostan_fit methods.

Author(s)

Connor Donegan, connor.donegan@gmail.com

Source

Donegan, Connor and Chun, Yongwan and Griffith, Daniel A. (2021). Modeling community health with areal data: Bayesian inference with survey standard errors and spatial structure. Int. J. Env. Res. and Public Health 18 (13): 6856. DOI: 10.3390/ijerph18136856 Data and code: https://github.com/ConnorDonegan/survey-HBM.

Donegan, Connor (2021). Building spatial conditional autoregressive (CAR) models in the Stan programming language. OSF Preprints. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.31219/osf.io/3ey65")}.

Examples

##
## Linear regression model
##

N = 100
x <- rnorm(N)
y <- .5 * x + rnorm(N)
dat <- cbind(y, x)

# no. of MCMC samples
iter = 600

# fit model
fit <- stan_glm(y ~ x, data = dat, iter = iter, quiet = TRUE)

# see results with MCMC diagnostics
print(fit)

##
## Custom prior distributions
##

PL <- list(
      intercept = normal(0, 1),
      beta = normal(0, 1),
      sigma = student_t(10, 0, 2)
)

fit2 <- stan_glm(y ~ x, data = dat, prior = PL, iter = iter,
                quiet = TRUE)

print(fit2)

# example prior for two covariates
pl <- list(beta = normal(c(0, 0),
                         c(1, 1))
           )

##
## Poisson model for count data
## with county 'random effects' 
##

data(sentencing)

# note: 'name' is county identifier
head(sentencing)

# denominator in standardized rate Y/E
# (observed count Y over expected count E)
# (use the log-denominator as the offest term)
sentencing$log_e <- log(sentencing$expected_sents)

# fit model
fit.pois <- stan_glm(sents ~ offset(log_e),
                     re = ~ name,
                     family = poisson(),
                     data = sentencing,                    
                    iter = iter, quiet = TRUE) 

# Spatial autocorrelation/residual diagnostics
sp_diag(fit.pois, sentencing)

# summary of results with MCMC diagnostics
print(fit.pois)


# MCMC diagnostics plot: Rhat values should all by very near 1
rstan::stan_rhat(fit.pois$stanfit)


# effective sample size for all parameters and generated quantities
# (including residuals, predicted values, etc.)
rstan::stan_ess(fit.pois$stanfit)

# or for a particular parameter
rstan::stan_ess(fit.pois$stanfit, "alpha_re")


##
## Visualize the posterior predictive distribution
##

# plot observed values and model replicate values
yrep <- posterior_predict(fit.pois, S = 65)
y <- sentencing$sents
ltgray <- rgb(0.3, 0.3, 0.3, 0.5)

plot(density(yrep[1,]), col = ltgray,
     ylim = c(0, 0.014), xlim = c(0, 700),
     bty = 'L', xlab = NA, main = NA)

for (i in 2:nrow(yrep)) lines(density(yrep[i,]), col = ltgray)

lines(density(sentencing$sents), col = "darkred", lwd = 2)

legend("topright", legend = c('Y-observed', 'Y-replicate'),
       col = c('darkred', ltgray), lwd = c(1.5, 1.5))

# plot replicates of Y/E
E <- sentencing$expected_sents

# set plot margins
old_pars <- par(mar=c(2.5, 3.5, 1, 1))

# plot yrep
plot(density(yrep[1,] / E), col = ltgray,
    ylim = c(0, 0.9), xlim = c(0, 7),
    bty = 'L', xlab = NA, ylab = NA, main = NA)

for (i in 2:nrow(yrep)) lines(density(yrep[i,] / E), col = ltgray)

# overlay y
lines(density(sentencing$sents / E), col = "darkred", lwd = 2)

# legend, y-axis label
legend("topright", legend = c('Y-observed', 'Y-replicate'),
      col = c('darkred', ltgray), lwd = c(1.5, 1.5))

mtext(side = 2, text = "Density", line = 2.5)

# return margins to previous settings
par(old_pars)

geostan documentation built on April 3, 2025, 10:04 p.m.

geostan index

Package overview README.md Custom spatial models with RStan and geostan Exploratory spatial data analysis Raster regression Spatial analysis with geostan Spatial measurement error models Spatial weights matrix

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

geostan
Bayesian Spatial Analysis

stan_glm: Generalized linear models
In geostan: Bayesian Spatial Analysis

Generalized linear models

Description

Usage

Arguments

Details

Spatially lagged covariates (SLX)

Measurement error (ME) models

Missing data

Censored counts

Value

Author(s)

Source

Examples

Related to stan_glm in geostan...

R Package Documentation

Browse R Packages

We want your feedback!

geostan Bayesian Spatial Analysis

stan_glm: Generalized linear models In geostan: Bayesian Spatial Analysis

Generalized linear models

Description

Usage

Arguments

Details

Spatially lagged covariates (SLX)

Measurement error (ME) models

Missing data

Censored counts

Value

Author(s)

Source

Examples

Related to stan_glm in geostan...

R Package Documentation

Browse R Packages

We want your feedback!

geostan
Bayesian Spatial Analysis

stan_glm: Generalized linear models
In geostan: Bayesian Spatial Analysis