\providecommand{\simiid}{~\overset{iid}{\sim}~} \providecommand{\dd}{\text{ d}}
This vignette explains briefly how to use this package.
library(bayessource) knitr::opts_chunk$set(fig.datapi = 96)
The package has been conceived to evaluate the hypothesis whether two sets of items (a reference set and a questioned set) belong to the same population or not.
Each item is described with a vector of $p$ measurements.
The evaluation is performed using Bayesian statistics, particularly Gibbs sampling.
The model we applied is that these populations are represented as samples from Multivariate Gaussian distributions, with unknown means and covariance matrices.
Particular care has been given in order to obtain strongly performing functions. The main core is written using Rcpp
.
For more theoretical details, see [@Bozza2008Probabilistic].
The package supplies several functions to compute common densities (e.g. Wishart, Inverted Wishart), and two functions to compute the marginal likelihood of observations, as well as the full Likelihood Ratio (marginalLikelihood()
, samesource_C()
).
make_priors_and_init()
: obtain hyperpriors and initialization from a background datasetmarginalLikelihood()
: fast computation of the marginal likelihoodsamesource_C()
: fast computation of the Bayes Factor (same source vs. different sources)mcmc_postproc()
: collect and tidy posterior samples from this packageThis section describes the usage on some made-up data.
We create some dummy data, generated by two bivariate Gaussian distributions with known means and covariances.
Covariance matrices are generated using the bundled rwish()
function, to obtain invertible matrices with ease.
set.seed(123) p <- 2 mean.quest <- c(0, 0) mean.ref <- c(2, 1) cov.quest <- rwish(3, diag(2)) cov.ref <- rwish(5, diag(2)) n.quest <- 20 n.ref <- n.quest df.quest <- data.frame(rmvnorm(n.quest, mean.quest, cov.quest)) df.ref <- data.frame(rmvnorm(n.ref, mean.ref, cov.ref))
Here are the datasets:
library(ggplot2) ggplot() + geom_point(aes(x = X1, y = X2), col = 'red', data = df.quest) + geom_point(aes(x = X1, y = X2), col = 'blue', data = df.ref)
It is clear that the two samples come from different populations, hence we expect a low likelihood-ratio value.
The package implements a two-sample Bayesian Hierarchical model with Gaussian multivariate likelihoods, and Inverse-Wishart prior on the covariance matrices.
The theoretical details are specified in [@Bozza2008Probabilistic].
Let us recall the model definition.
We suppose to have background observations from a set of $m$ sources.
Each observation lies in a $p$-dimensional space.
We note with $X_{ij}$ the $j$-th sample from the $i$-th source, $i = 1, \ldots, m$. The $i$-th source is assumed to generate data from a Multivariate Normal, with mean vector $\theta_i$, and covariance matrix $W_i$.
\begin{align} X_{ij} \; | \; \theta_i, \; W_i &\sim N_p(\theta_i, W_i) \quad \forall j = 1, \ldots, n \ \theta_i \; | \; \mu, B &\sim N_p(\mu, B) \ W_i \; | \; U, n_w &\sim IW(U, n_w) \end{align}
where $n_w > 2\,p$, and $U$ is set s.t. $$E[W_i] = \frac{U}{n_w - 2(p + 1)}$$ (parametrization according to [@Press2012Applied]).
Since the full conditionals are conjugated, a Gibbs sampler can be implemented for this model. See the details in [@Bozza2008Probabilistic].
As the model is Bayesian, we are required to specify the hyperparameters $\mu, B, U, n_w$, as well as the Gibbs chain initialization $W_i$.
Notice that inference is propagated by supplying the inverses of covariance matrices, i.e. $B^{-1}$ and $W_i^{-1}$.
Example (hyper)priors can be set like this:
eps <- 0.1 B.inv <- eps*diag(p) W.inv.1 <- eps*diag(p) W.inv.2 <- eps*diag(p) U <- eps*diag(p) nw <- 2*(p + 1) + 1 mu <- (mean.quest + mean.ref)/2
The package proves the function make_priors_and_init()
to supply these parameters based on a background dataset.
It returns a list containing estimates for $\mu$, $B^{-1}$, $U$, the initialization for $W_i^{-1}$, and the smallest possible $n_w$ such that the matrices are invertible.
The prior elicitation is described in the dedicated vignette: vignette("Prior elicitation")
.
The package has been written to evaluate whether two sets of observations come from the same source ($H_p$) or not ($H_d$). Background information (the hyperparameters) is noted with letter $I = \left{\mu, B, U, n_w \right}$.
We note with $Y_{ij}$ the $j$-th sample from the $i$-th considered set, where $i \in [\text{reference}, \text{questioned}]$.
Collectively, we shorten $Y_i = \left{ Y_{ij} \right}_j$.
The Bayes Factor for this problem can be written as:
$$\text{BF} = \frac{ p(Y_{\text{reference}}, Y_{\text{questioned}} \mid I, H_p) }{p(Y_{\text{reference}}, Y_{\text{questioned}} \mid I, H_d)}$$
Notice that the numerator is a marginal likelihood:
$$p(Y_{\text{reference}}, Y_{\text{questioned}} \mid I, H_p) = \int p( Y_{\text{reference}}, Y_{\text{questioned}} \mid \theta, W ) p(\theta, W \mid \mu, B, U, n_w) \dd \theta \dd W $$
and the denominator is a product of marginal likelihoods (assuming independence between sources under $H_d$):
\begin{align} p(Y_{\text{reference}}, Y_{\text{questioned}} \mid I, H_d) &= p(Y_{\text{reference}} \mid I, H_d) p(Y_{\text{questioned}} \mid I, H_d) = \ &= \left( \int p( Y_{\text{reference}} \mid \theta, W ) p(\theta, W \mid \mu, B, U, n_w) \dd \theta \dd W \right) \left( \int p( Y_{\text{questioned}} \mid \theta, W ) p(\theta, W \mid \mu, B, U, n_w) \dd \theta \dd W \right) \end{align}
The marginal likelihood is computed with the function marginalLikelihood()
from the Gibbs sampler output using [@Chib1995Marginal].
E.g. here we compute $p(Y_{\text{questioned}} \mid I, H_d)$:
burn.in = 1000 n.iter = 10000 marginalLikelihood(as.matrix(df.quest), n.iter, B.inv, W.inv.1, U, nw, mu, burn.in, verbose = FALSE)
the LR value can be computed as well, now considering two samples.
The function is samesource_C()
: (W.inv.2
is used only for chain initalisation).
samesource_C(as.matrix(df.quest), as.matrix(df.ref), n.iter, B.inv, W.inv.1, W.inv.2, U, nw, mu, burn.in, verbose = FALSE)
Notice how low it is compared to using a subset of the reference data as the questioned items (same source, supporting $H_p$):
samesource_C(as.matrix(df.ref)[1:20,], as.matrix(df.ref), n.iter, B.inv, W.inv.1, W.inv.2, U, nw, mu, burn.in, verbose = FALSE)
All marginal likelihoods in the BF formula can also be obtained by specifying marginals = TRUE
:
samesource_C( as.matrix(df.quest), as.matrix(df.ref), n.iter, B.inv, W.inv.1, W.inv.2, U, nw, mu, burn.in, verbose = FALSE, marginals = TRUE )
The package supports the output of the entire chain for $\theta_i$ and $W^{-1}_i$ (i.e., the inverse of $W_i$).
At the time, this is possible only during the computation of a single marginal likelihood, in this case the one related to the sample from the questioned population, namely:
$\left( \int p( Y_{\text{questioned}} \mid \theta, W ) p(\theta, W \mid \mu, B, U, n_w) \dd \theta \dd W \right)$
results <- marginalLikelihood( as.matrix(df.quest), n.iter, B.inv, W.inv.1, U, nw, mu, burn.in, output.mcmc = TRUE )
Notice that results
now is a list
, where results$value
holds the marginal likelihood value, and results$mcmc
is the coda
object which holds the chain output.
head(results$mcmc, 4)
Remember that R is column-major: W.inv.1
is $W^{-1}_1(1,1)$, W.inv.2
is $W^{-1}_1(2,1)$ and so on.
Using standard coda
tools, we can perform diagnostics, such as summaries:
library(coda) summary(results$mcmc)
and traceplots:
traceplot(results$mcmc)
We can recover the original matrices by hand, reshaping the desired columns (e.g. for W.inv
) into a matrix/3D array:
n.samples <- nrow(results$mcmc) W.inv.samples <- results$mcmc[, paste0('W.inv.', seq(1:(p^2)))] head(W.inv.samples, 5) W.inv.samples.cube <- array(W.inv.samples, dim = c(n.samples, p, p)) dim(W.inv.samples.cube)
or using the supplied post-processing function mcmc_postproc()
:
list.postproc <- mcmc_postproc(results$mcmc, compute.ML = TRUE, cumulative = TRUE) str(list.postproc$theta.samples) str(list.postproc$W.samples)
It also allows for easy computation of posterior point estimators:
list.postproc$theta.samples.ML list.postproc$W.samples.ML
More advanced diagnostics are available with the recent {bayesplot} package:
suppressPackageStartupMessages(library(bayesplot, quietly = TRUE)) bayesplot::mcmc_areas(x = results$mcmc)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.