This vignette provides the example used as Exercise 2 of the SBC tutorial presented at SBC StanConnect. Feel free to head to the tutorial website to get an interactive version and solve the problems yourself.
Let's setup the environment:
library(SBC); use_cmdstanr <- getOption("SBC.vignettes_cmdstanr", TRUE) # Set to false to use rstan instead if(use_cmdstanr) { library(cmdstanr) } else { library(rstan) rstan_options(auto_write = TRUE) } options(mc.cores = parallel::detectCores()) # Uncomment below to have the fits evaluated in parallel # However, as this example evaluates just a few fits, it # is usually not worth the overhead. # library(future) # plan(multisession) # Setup caching of results if(use_cmdstanr) { cache_dir <- "./_bad_parametrization_SBC_cache" } else { cache_dir <- "./_bad_parametrization_rstan_SBC_cache" } if(!dir.exists(cache_dir)) { dir.create(cache_dir) }
Premise: we mistakenly assume that Stan does parametrize the Gamma distribution via shape and scale, when in fact Stan uses the shape and rate parametrization (see Gamma distribution on Wikipedia for details on the parametrizations.
cat(readLines("stan/bad_parametrization1.stan"), sep = "\n")
iter_warmup <- 1000 iter_sampling <- 1000 if(use_cmdstanr) { model_gamma <- cmdstan_model("stan/bad_parametrization1.stan") backend_gamma <- SBC_backend_cmdstan_sample( model_gamma, iter_warmup = iter_warmup, iter_sampling = iter_sampling, chains = 2) } else { model_gamma <- stan_model("stan/bad_parametrization1.stan") backend_gamma <- SBC_backend_rstan_sample( model_gamma, iter = iter_sampling + iter_warmup, warmup = iter_warmup, chains = 2) }
Build a generator to create simulated datasets.
set.seed(21448857) n_sims <- 10 single_sim_gamma <- function(N) { shape <- rlnorm(n = 1, meanlog = 0, sdlog = 1) scale <- rlnorm(n = 1, meanlog = 0, sdlog = 1.5) y <- rgamma(N, shape = shape, scale = scale) list( variables = list( shape = shape, scale = scale), generated = list( N = N, y = y) ) } generator_gamma <- SBC_generator_function(single_sim_gamma, N = 40) datasets_gamma <- generate_datasets( generator_gamma, n_sims)
results_gamma <- compute_SBC(datasets_gamma, backend_gamma, cache_mode = "results", cache_location = file.path(cache_dir, "model1"))
Here we also use the caching feature to avoid recomputing the fits when recompiling this vignette. In practice, caching is not necessary but is often useful.
10 simulations are enough to see something is wrong with the model. The problem is best seen on an ecdf_diff
plot - we even see the issue is primarily with the scale
variable!
plot_ecdf_diff(results_gamma)
However, we can use the rank histogram (with suitable number of bins) and the ecdf
plot to show different visualisations of the same problem. The rank histogram often tends to be intuitively more understandable than the ecdf
plots, but tweaking the number of bins is often necessary and the confidence interval is only approximate and has decreased sensitivity.
plot_rank_hist(results_gamma, bins = 10) plot_ecdf(results_gamma)
So we see that the simulation does not match the model. In practice, the problem may lie with the simulation, with the model or both. Here, we'll assume that the simulation is correct - we really wanted to work with scale and fix the model to match. I.e. we still represent scale in our model, but invert it to get rate before using Stan's gamma
distribution:
cat(readLines("stan/bad_parametrization2.stan"), sep = "\n")
if(use_cmdstanr) { model_gamma_2 <- cmdstan_model("stan/bad_parametrization2.stan") backend_gamma_2 <- SBC_backend_cmdstan_sample( model_gamma_2, iter_warmup = iter_warmup, iter_sampling = iter_sampling, chains = 2) } else { model_gamma_2 <- stan_model("stan/bad_parametrization2.stan") backend_gamma_2 <- SBC_backend_rstan_sample( model_gamma_2, iter = iter_sampling + iter_warmup, warmup = iter_warmup, chains = 2) }
results_gamma2 <- compute_SBC(datasets_gamma, backend_gamma_2, cache_mode = "results", cache_location = file.path(cache_dir, "model2"))
No obvious problems here, but if we wanted to be sure, we should have ran a lot more simulations.
plot_ecdf_diff(results_gamma2) plot_rank_hist(results_gamma2, bins = 10) plot_ecdf(results_gamma2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.