\renewcommand{\vec}[1]{\boldsymbol{#1}}

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  cache = TRUE,
  fig.width = 7
)

paste0_vec <- function(...){ paste(..., collapse = ',') }

This vignette introduces the package.

Data generation

Let's first create some data.
We will use the package rsamplestudy to generate a population and to extract the reference/questioned samples.

Here, we will be generating data from a "Dirichlet-Dirichlet" model.
Note that this has nothing to do with the models implemented in this package.

library(rstanBF)
library(dplyr)
library(rsamplestudy)


# set.seed(123)
p <- 5
n <- 100
m <- 10
alpha <- rep(p, p)

list_pop <- rsamplestudy::fun_rdirichlet_population(n, m, p, alpha = alpha)
list_pop

Sampling generation

The package compares two sets of items (reference and questioned) and evaluates their support to a pair of hypotheses.

We simulate a situation using the function make_dataset_splits from the package rsamplestudy:

k_ref <- 10
k_quest <- 10

# We sample from all sources
sources <- unique(list_pop$df_pop$source)
list_samples_diff <- list_pop$df_pop %>% rsamplestudy::make_dataset_splits(k_ref, k_quest)

source_same <- unique(list_samples_diff$df_reference$source)
list_samples_same <- list_pop$df_pop %>% rsamplestudy::make_dataset_splits(k_ref, k_quest, source_ref = source_same, source_quest = source_same)

list_samples_H <- list(H1 = list_samples_same, H2 = list_samples_diff)

These need to be converted to a friendly format (matrices and lists) from rsamplestudy output:

list_data_diff <- stanBF_prepare_rsamplestudy_data(list_pop, list_samples_diff)
list_data_same <- stanBF_prepare_rsamplestudy_data(list_pop, list_samples_same)

We see that each list contains the full set of observations, the set of indexes to reference items, and the set of indexes to the questioned items:

str(list_data_diff)

From now on, we are ready to use the rstanBF package.

Data model

The package supports several models, identified by their short names:

available_models()

More details are also available in data frame form:

df_models <- available_models(verbose = TRUE)
df_models

and also in text form:

available_models(verbose = TRUE, do_print = TRUE)

Hyperparameter elicitation

We choose the Dirichlet-Dirichlet model.

model <- 'DirDir'

The only hyperparameter which must be provided is $\alpha$, the Dirichlet parameter at the hyperprior level.

It is known since the data has been generated: let's use it, for now.

list_hyperpriors <- list(alpha = as.numeric(list_pop$alpha))

Using functions

If one wants to elicit a value for the hyperparameter(s) required by the model, the package supplies the function stanBF_elicit_hyperpriors.
See the documentation for details:

stanBF_elicit_hyperpriors(list_samples_same$df_background, model, mode_hyperparameter = 'ML', mode_ML = 'naive')
stanBF_elicit_hyperpriors(list_samples_same$df_background, model, mode_hyperparameter = 'ML', mode_ML = 'ML')

Compare the estimate with the generating value:

$\alpha = \left( r paste0_vec(list_pop$alpha) \right)$

Sampling

We need to set up the HMC parameters, first:

n.iter <- 2000
n.burnin <- 200
# n.iter <- 10000
# n.burnin <- 1000

n.chains <- 2
n.cores <- 2

Now we are ready! Let's sample:

stanBF_obj_same <- rstanBF::compute_BF_Stan(list_data_same, model, list_hyperpriors, data_other = NULL, n.iter, n.burnin, n.chains, n.cores, silent = TRUE)
stanBF_obj_diff <- rstanBF::compute_BF_Stan(list_data_diff, model, list_hyperpriors, data_other = NULL, n.iter, n.burnin, n.chains, n.cores, silent = TRUE)
stanBF_obj_same
stanBF_obj_diff

Bayes Factor

The objects contain the Bayes Factors:

stanBF_obj_same$BF
stanBF_obj_diff$BF

Stan objects

rstanBF objects contain stanfit objects (one under $H_1$, one under $H_2$), which are actually the outputs of Stan sampling procedure.

They are available under the stanfit property:

stanBF_obj_same$stanfit$H1
stanBF_obj_same$stanfit$H2

One can either further process them using standard tools (rstan post-processing functions, coda, etc.), or use the supplied functions in the package.

These functions usually do different things depending on the model (e.g. often only the likelihood).

Post-processing

What follows must be implemented for each model.
For now, only stanBF_turn models are completely specified.

Posteriors

Returned objects often implement a samples method to extract samples from the interesting posteriors:

head(rstanBF::samples(stanBF_obj_same))
head(rstanBF::samples(stanBF_obj_diff))

Plots

Returned objects often implement a plot_posteriors method:

rstanBF::plot_posteriors(stanBF_obj_same, 'theta')
rstanBF::plot_posteriors(stanBF_obj_diff, 'theta')

Prior and posterior predictive distributions

Models often implement draws from the prior and posterior predictive distributions.

Prior predictive: the names of the variables in the Stan source start with sim_.
Posterior predictive: the names of the variables in the Stan source start with pred_.

One can use the prior_pred and posterior_pred methods to extract and format them:

df_prior_pred_same <- prior_pred(stanBF_obj_same)
head(df_prior_pred_same, 20)

This is interpreted as an example of a dataset, generated according to the model, before having observed any data.

They can be used to assess if the specified hyperparameters are too restrictive or too wide.

df_posterior_pred_same <- posterior_pred(stanBF_obj_same)
head(df_posterior_pred_same, 20)

rstan functions

stanfit objects implement several plotting functions (see here for a reference).
For example, traceplots:

# A stanfit object
# stanBF_obj_same$stanfit$H1

# Plot using all params
stanBF_obj_diff$stanfit$H1 %>% rstan::stan_trace()
stanBF_obj_diff$stanfit$H2 %>% rstan::stan_trace()

bayesplot

The popular bayesplot package can be used to produce plots to assess convergence of the MCMC chains. It includes an interface to stanfit objects, returned by rstan.

It is easy to use bayesplot with this package. One must first extract all MCMC samples using the rstan tools:

library(bayesplot)

# A stanfit object
# stanBF_obj_same$stanfit$H1

# All variables except the logPosterior
all_vars_no_lp <- setdiff(names(stanBF_obj_same$stanfit$H2), 'lp__')
all_vars_no_lp

outputs_H1 <- stanBF_obj_same$stanfit$H1 %>% rstan::As.mcmc.list()
outputs_H2 <- stanBF_obj_same$stanfit$H2 %>% rstan::As.mcmc.list()

# Equivalent:
# outputs_H1 <- stanBF_obj_same$stanfit$H1 %>% rstan::extract(pars = all_vars_no_lp)
# outputs_H2 <- stanBF_obj_same$stanfit$H2 %>% rstan::extract(pars = all_vars_no_lp)

Then, one can use bayesplot methods. (see the documentation)
As an example, we can plot the posterior densities across MCMC chains, selecting only the theta_ref parameters:

# Density across chains
# use a regular expression to subset the interesting variables
outputs_H1 %>% bayesplot::mcmc_dens_overlay(regex_pars = '^theta_ref')

Parameter value checks

bayesplot contains methods to overlay generating values to posterior distributions.

Let's try for $\vec{\theta_i}$ (see the Dirichlet-Dirichlet model vignette).

Using data generated under $H_1$:

# Collect the true values
df_source_same <- list_pop$df_sources %>% filter(source == source_same) %>% select(-source)
df_source_same

true_vals <- as.numeric(df_source_same)
draws_same <- stanBF_obj_same$stanfit$H1 %>% 
  as.matrix(pars = 'theta_ref')

bayesplot::mcmc_recover_intervals(draws_same, true_vals)


lgaborini/rstanBF documentation built on March 10, 2021, 1:12 p.m.