There are two ways to fit a GLM model using Stan.

  1. Use rstanarm::stan_glm, which uses a precompiled stan model to fit your data (under the hood, using rstan::stan).
  2. Write the stan code for the model & fit using rstan::stan.

We will work through both examples here.

Using rstanarm

First let's select a dataset to which we want to fit our model.

data("mtcars")

Let's decide on the model we want to estimate. For now we will fit a really generic model to predict mpg.

library(rstanarm)
fit1 <- rstanarm::stan_glm(mpg / 10 ~ .,
                          data = mtcars)
print(fit1)

Here we fit the default stan model, which puts a normal(0, 1) prior on each of the covariate estimates.

In the rstanarm framework, we can modify the priors on the covariates, to set a particular distribution of priors (e.g. student_t or cauchy) or to enforce greater sparsity (e.g. hs() or hs_plus()). See prior distribitions for a description of prior distributions available.

fit2 <- rstanarm::stan_glm(mpg / 10 ~ .,
                           data = mtcars,
                           prior = cauchy(),
                           adapt_delta = 0.999,
                           iter = 4000
                           )
print(fit2)

For linear models, we also have the choice of putting a more global prior on the coefficient parameters, though a neat math trick which allows us to put a prior on R2. This reflects our overall confidence in the ability of the model to explain our outcome.

fit3 <- rstanarm::stan_lm(mpg / 10 ~ .,
                          data = mtcars,
                          prior = R2(location = 0.3, what = 'mode'),
                          adapt_delta = 0.999
                          )
print(fit3)

At the end of the day, the decision of how to select priors can be based on substantive knowledge (from previous experiments or domain experience) or on model performance (e.g. cross-validation performance).

Evaluating posterior fits

Now, let's assume these all sampled well & that we have good estimates for our parameters from the three fits.

First thing we might want to do is compare these posterior estimates.

library(bayesplot)
plot1 <- bayesplot::mcmc_dens_overlay(as.array(fit1),
                                      facet_args = list(ncol = 1),
                                      pars = c('cyl', 'wt'))
plot2 <- bayesplot::mcmc_dens_overlay(as.array(fit2),
                                      facet_args = list(ncol = 1),
                                      pars = c('cyl', 'wt'))
plot3 <- bayesplot::mcmc_dens_overlay(as.array(fit3),
                                      facet_args = list(ncol = 1),
                                      pars = c('cyl', 'wt'))
gridExtra::grid.arrange(plot1, plot2, ncol = 2)

These priors have very little impact on our estimated coefficient values.

Model comparison using loo

We can compare these models using a leave-one-out approximation.

library(loo)
loo::compare(loo(fit1, k_threshold = 0.7),
             loo(fit2, k_threshold = 0.7),
             loo(fit3, k_threshold = 0.7))

When comparing 3 models, we get back a ranking of the models in terms of their predictive performance on our data (lower models of . When comparing A positive result indicates that the second model is a better fit than the first.

Note that we can also plot the loo object.

Fitting GLM model

Let's try fitting a glm (logistic) regression model. This example is borrowed from the rstanarm vignette vignette("binomial", package = 'rstanarm').

data(wells)
wells$dist100 <- wells$dist / 100
bfit <- stan_glm(
   switch ~ dist100 + arsenic, 
   data = wells,
   family = binomial(link = "logit"),
   prior_intercept = normal(0, 10),
   QR = TRUE,
   chains = 2,
   iter = 2000
)
print(bfit)
#prior_summary(bfit)

Using stan code to fit the GLM model

file_path <- file.path('vignettes','stan_glm.stan')
lines <- readLines(file_path, encoding = 'ASCII')
for (n in 1:length(lines)) cat(lines[n], '\n')

Stan code is broken into several blocks --

  1. Data block: defines inputs expected by the model.
  2. Parameters block: defines the parameters.
  3. Transformed parameters: transformations of parameters; applies jacobian adjustment for transformations.
  4. Model block: defines the log-likelihood (posterior density) of the model.

There are other blocks (functions & generated quantities) but we are not using them here.

Fit the model to our data

In order to fit this model to our data using rstan::fit, we first need to prepare a named list to provide data to the stan model.

ols_estimate <- glm(mpg / 10 ~ . , data = mtcars, x = TRUE)
x <- ols_estimate$x
y <- ols_estimate$y

stan_data <- list(
  N = length(y),
  M = ncol(x),
  x = x,
  y = as.array(y)
)
str(stan_data)

Now we are ready to fit our model to our data.

stan_fit <- rstan::stan(file = file_path, data = stan_data)

Summarizing the fit object

The final step is to summarize the fit object.

print(stan_fit, pars = c('beta', 'sigma'))

You will notice that this summary is not "pretty" -- ie, our beta coefficients are not labelled in a pleasing way, as our results from rstanarm were.

We do, however, have access to the x-coefficient names.

ols_estimate$coefficients

And, we can still use the bayesplot & loo package functionality , although to use loo we will need to calculate our log-likelihood contributions for each observed value.

Here's an example of plotting our beta-estimates from this model fit.

bayesplot::mcmc_areas(as.array(stan_fit), regex_pars = 'beta.*')

It's also worth calling out the excellent tool called shinystan.

if (interactive())
  launch_shinystan(fit1)


NCBI-Hackathons/SuMu documentation built on May 31, 2019, 8:02 a.m.