\renewcommand{\vec}[1]{\boldsymbol{#1}}
knitr::opts_chunk$set( collapse = TRUE, comment = "#>", cache = TRUE, fig.width = 7 ) paste0_vec <- function(...){ paste(..., collapse = ',') }
This vignette introduces the package.
Let's first create some data.
We will use the package rsamplestudy
to generate a population and to extract the reference/questioned samples.
Here, we will be generating data from a "Dirichlet-Dirichlet" model.
Note that this has nothing to do with the models implemented in this package.
library(rstanBF) library(dplyr) library(rsamplestudy) # set.seed(123) p <- 5 n <- 100 m <- 10 alpha <- rep(p, p) list_pop <- rsamplestudy::fun_rdirichlet_population(n, m, p, alpha = alpha) list_pop
The package compares two sets of items (reference and questioned) and evaluates their support to a pair of hypotheses.
We simulate a situation using the function make_dataset_splits
from the package rsamplestudy
:
k_ref <- 10 k_quest <- 10 # We sample from all sources sources <- unique(list_pop$df_pop$source) list_samples_diff <- list_pop$df_pop %>% rsamplestudy::make_dataset_splits(k_ref, k_quest) source_same <- unique(list_samples_diff$df_reference$source) list_samples_same <- list_pop$df_pop %>% rsamplestudy::make_dataset_splits(k_ref, k_quest, source_ref = source_same, source_quest = source_same) list_samples_H <- list(H1 = list_samples_same, H2 = list_samples_diff)
These need to be converted to a friendly format (matrices and lists) from rsamplestudy
output:
list_data_diff <- stanBF_prepare_rsamplestudy_data(list_pop, list_samples_diff) list_data_same <- stanBF_prepare_rsamplestudy_data(list_pop, list_samples_same)
We see that each list contains the full set of observations, the set of indexes to reference items, and the set of indexes to the questioned items:
str(list_data_diff)
From now on, we are ready to use the rstanBF
package.
The package supports several models, identified by their short names:
available_models()
More details are also available in data frame form:
df_models <- available_models(verbose = TRUE) df_models
and also in text form:
available_models(verbose = TRUE, do_print = TRUE)
We choose the Dirichlet-Dirichlet model.
model <- 'DirDir'
The only hyperparameter which must be provided is $\alpha$, the Dirichlet parameter at the hyperprior level.
It is known since the data has been generated: let's use it, for now.
list_hyperpriors <- list(alpha = as.numeric(list_pop$alpha))
If one wants to elicit a value for the hyperparameter(s) required by the model, the package supplies the function stanBF_elicit_hyperpriors
.
See the documentation for details:
stanBF_elicit_hyperpriors(list_samples_same$df_background, model, mode_hyperparameter = 'ML', mode_ML = 'naive') stanBF_elicit_hyperpriors(list_samples_same$df_background, model, mode_hyperparameter = 'ML', mode_ML = 'ML')
Compare the estimate with the generating value:
$\alpha = \left( r paste0_vec(list_pop$alpha)
\right)$
We need to set up the HMC parameters, first:
n.iter <- 2000 n.burnin <- 200 # n.iter <- 10000 # n.burnin <- 1000 n.chains <- 2 n.cores <- 2
Now we are ready! Let's sample:
stanBF_obj_same <- rstanBF::compute_BF_Stan(list_data_same, model, list_hyperpriors, data_other = NULL, n.iter, n.burnin, n.chains, n.cores, silent = TRUE) stanBF_obj_diff <- rstanBF::compute_BF_Stan(list_data_diff, model, list_hyperpriors, data_other = NULL, n.iter, n.burnin, n.chains, n.cores, silent = TRUE)
stanBF_obj_same stanBF_obj_diff
The objects contain the Bayes Factors:
stanBF_obj_same$BF stanBF_obj_diff$BF
rstanBF objects contain stanfit
objects (one under $H_1$, one under $H_2$), which are actually the outputs of Stan sampling procedure.
They are available under the stanfit
property:
stanBF_obj_same$stanfit$H1 stanBF_obj_same$stanfit$H2
One can either further process them using standard tools (rstan
post-processing functions, coda
, etc.), or use the supplied functions in the package.
These functions usually do different things depending on the model (e.g. often only the likelihood).
What follows must be implemented for each model.
For now, only stanBF_turn
models are completely specified.
Returned objects often implement a samples
method to extract samples from the interesting posteriors:
head(rstanBF::samples(stanBF_obj_same)) head(rstanBF::samples(stanBF_obj_diff))
Returned objects often implement a plot_posteriors
method:
rstanBF::plot_posteriors(stanBF_obj_same, 'theta') rstanBF::plot_posteriors(stanBF_obj_diff, 'theta')
Models often implement draws from the prior and posterior predictive distributions.
Prior predictive: the names of the variables in the Stan source start with sim_
.
Posterior predictive: the names of the variables in the Stan source start with pred_
.
One can use the prior_pred
and posterior_pred
methods to extract and format them:
df_prior_pred_same <- prior_pred(stanBF_obj_same) head(df_prior_pred_same, 20)
This is interpreted as an example of a dataset, generated according to the model, before having observed any data.
They can be used to assess if the specified hyperparameters are too restrictive or too wide.
df_posterior_pred_same <- posterior_pred(stanBF_obj_same) head(df_posterior_pred_same, 20)
stanfit
objects implement several plotting functions (see here for a reference).
For example, traceplots:
# A stanfit object # stanBF_obj_same$stanfit$H1 # Plot using all params stanBF_obj_diff$stanfit$H1 %>% rstan::stan_trace() stanBF_obj_diff$stanfit$H2 %>% rstan::stan_trace()
The popular bayesplot package can be used to produce plots to assess convergence of the MCMC chains. It includes an interface to stanfit
objects, returned by rstan.
It is easy to use bayesplot
with this package.
One must first extract all MCMC samples using the rstan
tools:
library(bayesplot) # A stanfit object # stanBF_obj_same$stanfit$H1 # All variables except the logPosterior all_vars_no_lp <- setdiff(names(stanBF_obj_same$stanfit$H2), 'lp__') all_vars_no_lp outputs_H1 <- stanBF_obj_same$stanfit$H1 %>% rstan::As.mcmc.list() outputs_H2 <- stanBF_obj_same$stanfit$H2 %>% rstan::As.mcmc.list() # Equivalent: # outputs_H1 <- stanBF_obj_same$stanfit$H1 %>% rstan::extract(pars = all_vars_no_lp) # outputs_H2 <- stanBF_obj_same$stanfit$H2 %>% rstan::extract(pars = all_vars_no_lp)
Then, one can use bayesplot
methods. (see the documentation)
As an example, we can plot the posterior densities across MCMC chains, selecting only the theta_ref
parameters:
# Density across chains # use a regular expression to subset the interesting variables outputs_H1 %>% bayesplot::mcmc_dens_overlay(regex_pars = '^theta_ref')
bayesplot
contains methods to overlay generating values to posterior distributions.
Let's try for $\vec{\theta_i}$ (see the Dirichlet-Dirichlet model vignette).
Using data generated under $H_1$:
# Collect the true values df_source_same <- list_pop$df_sources %>% filter(source == source_same) %>% select(-source) df_source_same true_vals <- as.numeric(df_source_same) draws_same <- stanBF_obj_same$stanfit$H1 %>% as.matrix(pars = 'theta_ref') bayesplot::mcmc_recover_intervals(draws_same, true_vals)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.