\newcommand{\hr}[1]{{#1}{\text{ref}}} \newcommand{\hq}[1]{{#1}{\text{quest}}} \newcommand{\hrq}[1]{{#1}_{\text{ref,quest}}} \renewcommand{\vec}[1]{\boldsymbol{#1}} \newcommand{\tot}[1]{\overline{#1}} \newcommand{\simiid}{\overset{iid}{~\sim~}} \newcommand{\BF}{\mathrm{BF}} \newcommand{\indep}{~\perp!!!\perp~} \newcommand{\EE}{\mathop{\mathbb{E}}} \newcommand{\Var}{\mathop{\mathrm{Var}}} \newcommand{\dirichlet}[1]{\text{Dirichlet}{(#1)}}
set.seed(123) knitr::opts_chunk$set( collapse = TRUE, cache = TRUE, comment = "#>", fig.width = 7, cache.extra = knitr::rand_seed, autodep=TRUE )
library(rstanBF) library(rsamplestudy) library(dplyr)
This package implements the Dirichlet-Dirichlet model.
In this package, the model is referred by its short name: 'DirDir'
.
Consider Dirichlet samples $X_i$ from $m$ different sources. Each source is sampled $n$ times.
Dirichlet parameter are also assumed to be sampled from Dirichlet distributions.
We assume that $\vec{\alpha}$ is known.
Let's generate some data using rsamplestudy
:
n <- 50 # Number of items per source m <- 5 # Number of sources p <- 4 # Number of variables per item # The generating hyperparameter: components not too small alpha <- c(1, 0.7, 1.3, 1) list_pop <- fun_rdirichlet_population(n, m, p, alpha = alpha)
Data:
r p
$ r n
$ samples from $m = r m
$ sources: in total, $r n*m
$ samples.r alpha
\right)$list_pop$df_pop %>% head(10)
Sources:
list_pop$df_sources %>% head(10)
We then need to simulate a situation where two sets of samples (reference and questioned data) are recovered.
These sets are compared, and the results evaluated using the Bayesian approach.
We want to evaluate two hypotheses:
One can generate these sets from the full data using rsamplestudy
.
Let's fix the reference source:
k_ref <- 10 k_quest <- 10 source_ref <- sample(list_pop$df_pop$source, 1) # The reference source source_ref
list_samples <- make_dataset_splits(list_pop$df_pop, k_ref, k_quest, source_ref = source_ref) # The true (unseen) questioned sources list_samples$df_questioned$source
We want to use the background data to evaluate the hyperpriors, i.e., the between-source Dirichlet parameter.
Notice that we have r nrow(list_samples$df_background)
samples, with r nrow(list_pop$df_pop)
as the population size.
This should ensure that our hyperparameter estimates are good enough.
One can start to estimate the parameters for each source, although this is not required by the rstanBF
package.
According to the model, this should be an estimate of the Dirichlet parameters $\vec{\theta}_i$.
The package contains functions to compute ML estimators of Dirichlet parameters.
For now, they are hidden from the general namespace, but can still be called if needed.
Naive estimator:
df_sources_est <- rstanBF:::fun_estimate_Dirichlet_from_samples(list_samples$df_background, use = 'naive') df_sources_est
ML estimator:
df_sources_est_ML <- rstanBF:::fun_estimate_Dirichlet_from_samples(list_samples$df_background, use = 'ML') df_sources_est_ML
Compare with the real sources $\vec{\theta}_i$:
list_pop$df_sources
The ML estimator is much more accurate.
We suppose that the within-source ML estimates are samples from the same hyper-source, so the procedure can be repeated.
We use again the ML estimator:
df_alpha_ML_ML <- df_sources_est_ML %>% dplyr::select(-source) %>% rstanBF:::fun_estimate_Dirichlet_from_single_source(use = 'ML') df_alpha_ML_ML
Compare with the real Dirichlet hyperparameter:
$\vec{\alpha} = \left( r list_pop$alpha
\right)$
rstanBF provides stanBF_elicit_hyperpriors
, a function to estimate the hyperparameters according to a specified model:
list_hyper <- stanBF_elicit_hyperpriors(list_samples$df_background, 'DirDir', mode_hyperparameter = 'ML') list_hyper
Once all is set, one can run the HMC sampler:
list_data <- stanBF_prepare_rsamplestudy_data(list_pop, list_samples) obj_StanBF <- compute_BF_Stan(data = list_data, model = 'DirDir', hyperpriors = list_hyper)
The results:
obj_StanBF
Posteriors:
plot_posteriors(obj_StanBF)
Get the posterior predictive samples under all hypotheses:
df_posterior_pred <- posterior_pred(obj_StanBF) head(df_posterior_pred, 10)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.