This vignette provides the example used as Exercise 3 of the SBC tutorial presented at SBC StanConnect. Feel free to head to the tutorial website to get an interactive version and solve the problems yourself.
Let's setup the environment.
library(SBC); library(ggplot2) use_cmdstanr <- getOption("SBC.vignettes_cmdstanr", TRUE) # Set to false to use rstan instead if(use_cmdstanr) { library(cmdstanr) } else { library(rstan) rstan_options(auto_write = TRUE) } # Setup caching of results if(use_cmdstanr) { cache_dir <- "./_indexing_SBC_cache" } else { cache_dir <- "./_indexing_rstan_SBC_cache" } if(!dir.exists(cache_dir)) { dir.create(cache_dir) }
Below are three different Stan codes for implementing a simple linear regression. Not all of them are correct - can you see which are wrong?
cat(readLines("stan/regression1.stan"), sep = "\n")
cat(readLines("stan/regression2.stan"), sep = "\n")
cat(readLines("stan/regression3.stan"), sep = "\n")
If you can, good for you! If not, don't worry, SBC can help you spot coding problems, so let's simulate data and test all three models against simulated data.
First we'll build backends using the individual models.
if(use_cmdstanr) { model_regression_1 <- cmdstan_model("stan/regression1.stan") model_regression_2 <- cmdstan_model("stan/regression2.stan") model_regression_3 <- cmdstan_model("stan/regression3.stan") backend_regression_1 <- SBC_backend_cmdstan_sample(model_regression_1, iter_warmup = 400, iter_sampling = 500) backend_regression_2 <- SBC_backend_cmdstan_sample(model_regression_2, iter_warmup = 400, iter_sampling = 500) backend_regression_3 <- SBC_backend_cmdstan_sample(model_regression_3, iter_warmup = 400, iter_sampling = 500) } else { model_regression_1 <- stan_model("stan/regression1.stan") model_regression_2 <- stan_model("stan/regression2.stan") model_regression_3 <- stan_model("stan/regression3.stan") backend_regression_1 <- SBC_backend_rstan_sample(model_regression_1, iter = 900, warmup = 400) backend_regression_2 <- SBC_backend_rstan_sample(model_regression_2, iter = 900, warmup = 400) backend_regression_3 <- SBC_backend_rstan_sample(model_regression_3, iter = 900, warmup = 400) }
Then we'll write a function that generates data. We write it in the most simple way to reduce the possibility that we make an error. We also don't really need to worry about performance here.
single_sim_regression <- function(N, K) { x <- matrix(rnorm(n = N * K, mean = 0, sd = 1), nrow = N, ncol = K) alpha <- rnorm(n = 1, mean = 0, sd = 1) beta <- rnorm(n = K, mean = 0, sd = 1) sigma <- abs(rnorm(n = 1, mean = 0, sd = 2)) y <- array(NA_real_, N) for(n in 1:N) { mu <- alpha for(k in 1:K) { mu <- mu + x[n,k] * beta[k] } y[n] <- rnorm(n = 1, mean = mu, sd = sigma) } list( variables = list( alpha = alpha, beta = beta, sigma = sigma), generated = list( N = N, K = K, x = x, y = y ) ) }
We'll start with just 10 simulations to get a quick computation - this will still let us see big problems (but not subtle issues)
set.seed(5666024) datasets_regression <- generate_datasets( SBC_generator_function(single_sim_regression, N = 100, K = 2), 10)
Now we can use all of the backends to fit the generated datasets.
results_regression_1 <- compute_SBC(datasets_regression, backend_regression_1, cache_mode = "results", cache_location = file.path(cache_dir, "regression1")) results_regression_2 <- compute_SBC(datasets_regression, backend_regression_2, cache_mode = "results", cache_location = file.path(cache_dir, "regression2")) results_regression_3 <- compute_SBC(datasets_regression, backend_regression_3, cache_mode = "results", cache_location = file.path(cache_dir, "regression3"))
Here we also use the caching feature to avoid recomputing the fits when recompiling this vignette. In practice, caching is not necessary but is often useful.
plot_ecdf_diff(results_regression_1) plot_rank_hist(results_regression_1)
As far as a quick SBC can see the first code is OK. You could verify further with more iterations but we tested the model for you and it is OK (although the implementation is not the best one).
plot_ecdf_diff(results_regression_2) plot_rank_hist(results_regression_2)
But the second model is actually not looking good. In fact there is an indexing bug. The problem is the line
mu[i] += beta[j] * x[j, j];
which should have x[i, j]
instead. We see that this
propagates most strongly to the sigma
variable (reusing the same x
element leads to more similar predictions for each row, so sigma
needs to be inflated to accommodate this)
plot_ecdf_diff(results_regression_3) plot_rank_hist(results_regression_3)
And the third model looks OK once again - and in fact we are pretty sure it is also completely correct.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.