knitr::opts_chunk$set(collapse = T)

This document gives an end to end example of using the rstansim package to carry out a small simulation study using stan. Itâ€™s recommended as a first point of call to understand how the package works, but does not provided detailed documentation of all functionality.

After reading through this example, the below documents can provide further detailed documentation on specific elements:

Beyond that function documentation can be found here.

Any simulation study needs to start with a question about the behaviour of certain models/estimators under specific conditions. In this case, our question comes straight from the stan manual.

In the section on regression models, the first example of a multiple regression model in stan is a straightforward vectorised implementation, shown below.

```{R, engine = 'cat', engine.opts = list(file = "simple_reg_model.stan", lang = "stan")}
// saved as simple_reg_model.stan
data {
int

However immediately after this is introduced another way to format a multiple regression, named the QR Reparameterization, is introduced. Again this is shown below. ```{R, engine = 'cat', engine.opts = list(file = "qr_reg_model.stan", lang = "stan")} // saved as qr_reg_model.stan data { int<lower=0> N; // number of data items int<lower=0> K; // number of predictors matrix[N, K] x; // predictor matrix vector[N] y; // outcome vector } transformed data { matrix[N, K] Q_ast; matrix[K, K] R_ast; matrix[K, K] R_ast_inverse; // thin and scale the QR decomposition Q_ast = qr_Q(x)[, 1:K] * sqrt(N - 1); R_ast = qr_R(x)[1:K, ] / sqrt(N - 1); R_ast_inverse = inverse(R_ast); } parameters { real alpha; // intercept vector[K] theta; // coefficients on Q_ast real<lower=0> sigma; // error scale } model { y ~ normal(Q_ast * theta + alpha, sigma); // likelihood } generated quantities { vector[K] beta; beta = R_ast_inverse * theta; // coefficients on x }

The stan manual provides more detail on the derivation of the QR parameterization, and the reasons for it's efficiency. In short the reasons given to use the QR model are that it:

- Gives equivalent posterior parameter estimates to the original model
- Can decrease the total elapsed time to fit a model.
- Can improve the effective sample size (ESS) for sampled parameter chains.

This small simulation study will explore the above three advantages to see if they hold in the case of small to medium predictor correlations, and to assess the magnitude of any observed differences between the two models.

In order to test our above hypotheses we'll be simulating data two generative models. The models only differ in their parameter values and so can both be simulaetd using a single stan model (if there were structural differences then they would require different models).

One of the advantages of using rstansim to generate data for a simulation study is that it lets you use a stan model to specify the data, this gives you a huge degree of flexibility in how the simulated data will look, and can even let you draw your posterior samples using the same code that generated your data. For clarity here though we've generated a small stan model purely for the purposes of generating our simulated data, it's shown below.

```{R, engine = 'cat', engine.opts = list(file = "sim_data_model.stan", lang = "stan")}
// saved as sim_data_model.stan
data {
int

cov_omega = quad_form_diag(omega, rep_vector(1, K)); } generated quantities { matrix[N, K] sim_x; // simulated predictor data vector[N] sim_y; // simulated outcome vector

for(i in 1:N) sim_x[i,] = multi_normal_rng(rep_vector(0, K), cov_omega)'; for(i in 1:N) sim_y[i] = normal_rng(alpha + sim_x[i] * beta, sigma)'; }

This model is cut down to only be used for simulating the desired data. The important points to note are: 1. Any data or parameters that you wish to vary in order to produce different simulated datasets are entered in the `data` or `parameter` blocks, as with a typical stan model. 2. All variables to be simulated are specified in the `generated quantities` block, using stan's built in random number generators. 3. Note that the simulated variables have their names prepended with `sim_`, this is good practice to differentiate simulated quantities in the model from user provided values. By default this `sim_` will be dropped from the saved dataset so that the data can easily be used as model input (e.g. `sim_x` becomes just `x`). The `simulate_data()` function can take this model, along with data and parameter values to use, and use them to simulate multiple datasets containing the relevant data. When called `simulate_data()` has two effects; it saves all simulated datasets to a specified directory, and it returns an object of class `stansim_data` which acts as a 'ticket' for the saved datasets; providing summary information and able to be fed directly in to the `fit_models()` function below, simplifying the workflow. For the purposes of our simulation we want to simulate 100 datasets, 50 of which will have uncorrelated `x` columns, and 50 of which will have correlations varying between 0 and 0.4. It's expected that the benefits of the QR model will be greater for the correlated predictors datasets. The input data and parameters for these are specified below. ```{R} # set up independent covariance matrix omega_uncorr <- diag(5) omega_uncorr # set up correlated covariance matrix omega_corr <- matrix(rbind(c(1, 0, .2, .3, .4), c(0, 1, .3, .4, .2), c(.2, .3, 1, 0, .4), c(.3, .4, 0, 1, .2), c(.4, .2, .4, .2, 1)), 5, 5) omega_corr regression_input_data <- list("N" = 1000, "K" = 5) regression_uncorr_params <- list("alpha" = 5, "beta" = c(0, 0, .8, .6, .4), "sigma" = 1, "omega" = omega_uncorr) regression_corr_params <-list("alpha" = 5, "beta" = c(0, 0, .8, .6, .4), "sigma" = 1, "omega" = omega_corr)

With our input values set up we can now simulate data easily.

```{R, results = "hide", message = F, warning = F} library(rstansim)

uncorrelated_data <- simulate_data( file = "sim_data_model.stan", data_name = "uncorrelated multi-regression", input_data = regression_input_data, param_values = regression_uncorr_params, vars = c("N", "K", "sim_x", "sim_y"), nsim = 50, path = "reg_data/uncorrelated", seed = 1234 )

correlated_data <- simulate_data( file = "sim_data_model.stan", data_name = "correlated multi-regression", input_data = regression_input_data, param_values = regression_corr_params, vars = c("N", "K", "sim_x", "sim_y"), nsim = 50, path = "reg_data/correlated", seed = 1234 )

Looking in our working directory shows that the specified folders have been created and filled with our simulated datasets. ```{R} # see folders inside reg_data/ dir("reg_data") # see first five files in reg_data/correlated/ dir("reg_data/correlated")[1:5]

For a quick check that things have run as expected we'll take a look at a given simuldated data file.

str(readRDS("reg_data/correlated/correlated multi-regression_1.rds"))

It's good practice to check that the simulated data returns sensible parameter estimates when fitted to a single model before investing the time and resources into fitting models to every dataset.

```{R, results = "hide", message = F, warning = FALSE}

test_data <- readRDS("reg_data/uncorrelated/uncorrelated multi-regression_1.rds")

test_fit <- rstan::stan(file = "simple_reg_model.stan", data = test_data)

```{R} rstan::summary(test_fit)$summary

These values are all in the region we'd expect. The `fit_models()`

function is used to fit a given stan model to multiple datasets, we can use the `stansim_data`

object produced above as input pointing to our simulated datasets. Our study questions has a 2x2 design so fit_models needs called 4 times.

```{R, echo=FALSE, results='asis'} stansim_output1 <- readRDS("stansim_output1.rds") stansim_output2 <- readRDS("stansim_output2.rds") stansim_output3 <- readRDS("stansim_output3.rds") stansim_output4 <- readRDS("stansim_output4.rds")

```{R, eval = F} # fit simple model to uncorrelated data stansim_output1 <- fit_models(sim_name = "simple - uncorrelated simulation", sim_data = uncorrelated_data, stan_args = list(file = "simple_reg_model.stan"), seed = 1234) # fit simple model to correlated data stansim_output2 <- fit_models(sim_name = "simple - correlated simulation", sim_data = correlated_data, stan_args = list(file = "simple_reg_model.stan"), seed = 1234) # fit QR model to uncorrelated data stansim_output3 <- fit_models(sim_name = "QR - uncorrelated simulation", sim_data = uncorrelated_data, stan_args = list(file = "qr_reg_model.stan"), seed = 1234) # fit QR model to correlated data stansim_output4 <- fit_models(sim_name = "QR - correlated simulation", sim_data = correlated_data, stan_args = list(file = "qr_reg_model.stan"), seed = 1234)

Given that a realistic simulation study may have many combinations of model and data to fit, it's likely that a large number of simulation objects will be created. To ease the analysis of results (and to make storage simpler) it's recommended that you collect these objects into one. All data held by the individual objects is retained.

reg_study_col <- collect_simulations( collection_name = "regression study collection", stansim_output1, stansim_output2, stansim_output3, stansim_output4)

Now that our simulations have been ran and collected we can easily explore the data across all models and datasets. A good first check is that no datasets have an R-hat value greater than 1.1 (a necessary but not sufficient condition for convergence).

extract_data( reg_study_col, estimates = "Rhat", values = function(x) x > 1.1 )

This is a good first sign and suggests that our models have converged.

The first of the hypotheses to be tested is that the simple regression model and the QR reparamatization produce equivalent parameter estimates. This is checked for both datasets below.

```{R, message = F, warning = F, fig.width=7}

correlated_beta_estimates <- extract_data( reg_study_col, sim_names = c("QR - correlated simulation", "simple - correlated simulation"), parameters = c("beta"), estimates = "mean")

uncorrelated_beta_estimates <- extract_data( reg_study_col, sim_names = c("QR - uncorrelated simulation", "simple - uncorrelated simulation"), parameters = c("beta"), estimates = "mean")

library(ggplot2) library(ggjoy)

ggplot(correlated_beta_estimates, aes(x=value, y = parameter)) + geom_joy() + facet_wrap(~sim_name)

ggplot(uncorrelated_beta_estimates, aes(x=value, y = parameter)) + geom_joy() + facet_wrap(~sim_name)

We can see that the mean beta estimates are essentially identical over the two models, for both correlated values predictor variables and uncorrelated. In practice, you would want to run similar analysis across all parameters and estimates of relevance. Tail quantiles for example are less robust than mean estimates and so might show differences in chain behaviour that the above charts do not. The second hypothesis concerned the total time taken to fit a model, at a very high level we can check this by extracting the total time taken to carry out each `fit_models()` call. Elapsed run times can be extracted from simulation or collection objects with `extract_time_elapsed()`. ```{R, message = F, warning = F, fig.width=7} # extract total times (warmup and sampling) for correlated datasets correlated_time <- extract_time_elapsed( reg_study_col, sim_names = c("QR - correlated simulation", "simple - correlated simulation"), stages = "total" ) # extract total times (warmup and sampling) for correlated datasets uncorrelated_time <- extract_time_elapsed( reg_study_col, sim_names = c("QR - uncorrelated simulation", "simple - uncorrelated simulation"), stages = "total" ) # plot total elapsed time for correlated datasets ggplot(correlated_time, aes(x=elapsed, y = sim_name)) + geom_joy() # plot total elapsed time for uncorrelated datasets ggplot(uncorrelated_time, aes(x=elapsed, y = sim_name)) + geom_joy()

When concerned with total run time, QR models tended to perform slightly quicker than the simple model when predictor variables were correlated, and slightly slower in cases where the predictors were uncorrelated. In both cases the QR model had a longer right tail, showing slightly less stability in the total time taken to fit the models.

Total time taken can be misleading when dealing with MCMC samples as it doesn't capture the amount of sampling information captured, which can vary a lot from model to model. It's easy to extract the number of effective samples for all four fitted models and compare them by parameter.

```{R, message = F, warning = F, fig.width=7, fig.height = 7}

n_eff_correlated <- extract_data(reg_study_col, sim_names = c("QR - correlated simulation", "simple - correlated simulation"), parameters = c("beta", "sigma", "lp__"), estimates = "n_eff")

n_eff_uncorrelated <- extract_data(reg_study_col, sim_names = c("QR - uncorrelated simulation", "simple - uncorrelated simulation"), parameters = c("beta", "sigma", "lp__"), estimates = "n_eff")

ggplot(n_eff_correlated, aes(x = value, y = sim_name)) + geom_jitter() + facet_wrap(~parameter)

ggplot(n_eff_uncorrelated, aes(x = value, y = sim_name)) + geom_jitter() + facet_wrap(~parameter)

The results are close to identical for both models, with all shared parameters showing a near perfect 4000 effective samples per model (4 chains * 1000 samples) with the exception of the log posterior. The only noticable difference between models is a very slightly higher number of effective samples of the log posterior for the QR model when fitting to datasets with correlated predictor variables. ## Conclusions The data and models used here are intended primarily as an example of rstansim functionality, not as a rigourous exploration of the QR decomposition approach to Bayesian regression. Such a study would explore a wider range of conditions, especially higher correlations between predictor variable [where the difference between the straightforward and QR models becomes very apparent](http://mc-stan.org/users/documentation/case-studies/qr_regression.html). But at a minimum these results show that the QR regression is not a requirement for good sampling behaviour when the sample size is high (N = 1000) and the predictor correlations small (< .4). ```r ## clean up all created files from vignetes/ # remove stan models if(file.exists("simple_reg_model.stan")) unlink("simple_reg_model.stan") if(file.exists("qr_reg_model.stan")) unlink("qr_reg_model.stan") if(file.exists("sim_data_model.stan")) unlink("sim_data_model.stan") # if simulated data dir has been created, delete it if(dir.exists("reg_data")) unlink("reg_data", recursive = T)

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.