In lgaborini/bayessource: Bayesian approach to evaluate whether two datasets come from the same population

\providecommand{\simiid}{~\overset{iid}{\sim}~} \providecommand{\dd}{\text{ d}}

This vignette explains briefly how to use this package.

library(bayessource)
knitr::opts_chunk$set(fig.datapi = 96)

Package goal

The package has been conceived to evaluate the hypothesis whether two sets of items (a reference set and a questioned set) belong to the same population or not.
Each item is described with a vector of $p$ measurements.

The evaluation is performed using Bayesian statistics, particularly Gibbs sampling.

The model we applied is that these populations are represented as samples from Multivariate Gaussian distributions, with unknown means and covariance matrices.

Particular care has been given in order to obtain strongly performing functions. The main core is written using Rcpp.

For more theoretical details, see [@Bozza2008Probabilistic].

Package contents

The package supplies several functions to compute common densities (e.g. Wishart, Inverted Wishart), and two functions to compute the marginal likelihood of observations, as well as the full Likelihood Ratio (marginalLikelihood(), samesource_C()).

Main functions

make_priors_and_init(): obtain hyperpriors and initialization from a background dataset
marginalLikelihood(): fast computation of the marginal likelihood
samesource_C(): fast computation of the Bayes Factor (same source vs. different sources)
mcmc_postproc(): collect and tidy posterior samples from this package

Usage

This section describes the usage on some made-up data.

Sample data

We create some dummy data, generated by two bivariate Gaussian distributions with known means and covariances. Covariance matrices are generated using the bundled rwish() function, to obtain invertible matrices with ease.

set.seed(123)

p <- 2
mean.quest <- c(0, 0)
mean.ref <- c(2, 1)
cov.quest <- rwish(3, diag(2))
cov.ref <- rwish(5, diag(2))
n.quest <- 20
n.ref <- n.quest

df.quest <- data.frame(rmvnorm(n.quest, mean.quest, cov.quest))
df.ref <- data.frame(rmvnorm(n.ref, mean.ref, cov.ref))

Here are the datasets:

library(ggplot2)
ggplot() + 
   geom_point(aes(x = X1, y = X2), col = 'red', data = df.quest) +
   geom_point(aes(x = X1, y = X2), col = 'blue', data = df.ref)

It is clear that the two samples come from different populations, hence we expect a low likelihood-ratio value.

Model and prior specification

The package implements a two-sample Bayesian Hierarchical model with Gaussian multivariate likelihoods, and Inverse-Wishart prior on the covariance matrices.
The theoretical details are specified in [@Bozza2008Probabilistic].

Background model

Let us recall the model definition.

We suppose to have background observations from a set of $m$ sources.
Each observation lies in a $p$-dimensional space.

We note with $X_{ij}$ the $j$-th sample from the $i$-th source, $i = 1, \ldots, m$. The $i$-th source is assumed to generate data from a Multivariate Normal, with mean vector $\theta_i$, and covariance matrix $W_i$.

\begin{align} X_{ij} \; | \; \theta_i, \; W_i &\sim N_p(\theta_i, W_i) \quad \forall j = 1, \ldots, n \ \theta_i \; | \; \mu, B &\sim N_p(\mu, B) \ W_i \; | \; U, n_w &\sim IW(U, n_w) \end{align}

where $n_w > 2\,p$, and $U$ is set s.t. $$E[W_i] = \frac{U}{n_w - 2(p + 1)}$$ (parametrization according to [@Press2012Applied]).

Computation

Since the full conditionals are conjugated, a Gibbs sampler can be implemented for this model. See the details in [@Bozza2008Probabilistic].

Prior elicitation

As the model is Bayesian, we are required to specify the hyperparameters $\mu, B, U, n_w$, as well as the Gibbs chain initialization $W_i$.
Notice that inference is propagated by supplying the inverses of covariance matrices, i.e. $B^{-1}$ and $W_i^{-1}$.

Example (hyper)priors can be set like this:

eps <- 0.1
B.inv <- eps*diag(p)
W.inv.1 <- eps*diag(p)
W.inv.2 <- eps*diag(p)
U <- eps*diag(p)
nw <- 2*(p + 1) + 1
mu <- (mean.quest + mean.ref)/2

The package proves the function make_priors_and_init() to supply these parameters based on a background dataset.

It returns a list containing estimates for $\mu$, $B^{-1}$, $U$, the initialization for $W_i^{-1}$, and the smallest possible $n_w$ such that the matrices are invertible.

The prior elicitation is described in the dedicated vignette: vignette("Prior elicitation").

Observation model

The package has been written to evaluate whether two sets of observations come from the same source ($H_p$) or not ($H_d$). Background information (the hyperparameters) is noted with letter $I = \left{\mu, B, U, n_w \right}$.

We note with $Y_{ij}$ the $j$-th sample from the $i$-th considered set, where $i \in [\text{reference}, \text{questioned}]$.
Collectively, we shorten $Y_i = \left{ Y_{ij} \right}_j$.

Bayes Factor

The Bayes Factor for this problem can be written as:

$$\text{BF} = \frac{ p(Y_{\text{reference}}, Y_{\text{questioned}} \mid I, H_p) }{p(Y_{\text{reference}}, Y_{\text{questioned}} \mid I, H_d)}$$

Notice that the numerator is a marginal likelihood:

$$p(Y_{\text{reference}}, Y_{\text{questioned}} \mid I, H_p) = \int p( Y_{\text{reference}}, Y_{\text{questioned}} \mid \theta, W ) p(\theta, W \mid \mu, B, U, n_w) \dd \theta \dd W $$

and the denominator is a product of marginal likelihoods (assuming independence between sources under $H_d$):

\begin{align} p(Y_{\text{reference}}, Y_{\text{questioned}} \mid I, H_d) &= p(Y_{\text{reference}} \mid I, H_d) p(Y_{\text{questioned}} \mid I, H_d) = \ &= \left( \int p( Y_{\text{reference}} \mid \theta, W ) p(\theta, W \mid \mu, B, U, n_w) \dd \theta \dd W \right) \left( \int p( Y_{\text{questioned}} \mid \theta, W ) p(\theta, W \mid \mu, B, U, n_w) \dd \theta \dd W \right) \end{align}

Computation

The marginal likelihood is computed with the function marginalLikelihood() from the Gibbs sampler output using [@Chib1995Marginal].
E.g. here we compute $p(Y_{\text{questioned}} \mid I, H_d)$:

burn.in = 1000
n.iter = 10000

marginalLikelihood(as.matrix(df.quest), n.iter, B.inv, W.inv.1, U, nw, mu, burn.in, verbose = FALSE)

the LR value can be computed as well, now considering two samples. The function is samesource_C(): (W.inv.2 is used only for chain initalisation).

samesource_C(as.matrix(df.quest), as.matrix(df.ref), n.iter, B.inv, W.inv.1, W.inv.2, U, nw, mu, burn.in, verbose = FALSE)

Notice how low it is compared to using a subset of the reference data as the questioned items (same source, supporting $H_p$):

samesource_C(as.matrix(df.ref)[1:20,], as.matrix(df.ref), n.iter, B.inv, W.inv.1, W.inv.2, U, nw, mu, burn.in, verbose = FALSE)

All marginal likelihoods in the BF formula can also be obtained by specifying marginals = TRUE:

samesource_C(
   as.matrix(df.quest),
   as.matrix(df.ref), 
   n.iter, B.inv, W.inv.1, W.inv.2, U, nw, mu, burn.in, verbose = FALSE, marginals = TRUE
)

Diagnostics

The package supports the output of the entire chain for $\theta_i$ and $W^{-1}_i$ (i.e., the inverse of $W_i$).

At the time, this is possible only during the computation of a single marginal likelihood, in this case the one related to the sample from the questioned population, namely:

$\left( \int p( Y_{\text{questioned}} \mid \theta, W ) p(\theta, W \mid \mu, B, U, n_w) \dd \theta \dd W \right)$

results <- marginalLikelihood(
   as.matrix(df.quest), 
   n.iter, B.inv, W.inv.1, U, nw, mu, burn.in, 
   output.mcmc = TRUE
   )

Notice that results now is a list, where results$value holds the marginal likelihood value, and results$mcmc is the coda object which holds the chain output.

head(results$mcmc, 4)

Remember that R is column-major: W.inv.1 is $W^{-1}_1(1,1)$, W.inv.2 is $W^{-1}_1(2,1)$ and so on.

Using standard coda tools, we can perform diagnostics, such as summaries:

library(coda)
summary(results$mcmc)

and traceplots:

traceplot(results$mcmc)

We can recover the original matrices by hand, reshaping the desired columns (e.g. for W.inv) into a matrix/3D array:

n.samples <- nrow(results$mcmc)
W.inv.samples <- results$mcmc[, paste0('W.inv.', seq(1:(p^2)))]
head(W.inv.samples, 5)
W.inv.samples.cube <- array(W.inv.samples, dim = c(n.samples, p, p))
dim(W.inv.samples.cube)

or using the supplied post-processing function mcmc_postproc():

list.postproc <- mcmc_postproc(results$mcmc, compute.ML = TRUE, cumulative = TRUE)
str(list.postproc$theta.samples)
str(list.postproc$W.samples)

It also allows for easy computation of posterior point estimators:

list.postproc$theta.samples.ML
list.postproc$W.samples.ML

More advanced diagnostics are available with the recent {bayesplot} package:

suppressPackageStartupMessages(library(bayesplot, quietly = TRUE))

bayesplot::mcmc_areas(x = results$mcmc)

References

lgaborini/bayessource documentation built on Nov. 9, 2021, 2:10 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

lgaborini/bayessource
Bayesian approach to evaluate whether two datasets come from the same population

In lgaborini/bayessource: Bayesian approach to evaluate whether two datasets come from the same population

Package goal

Package contents

Main functions

Usage

Sample data

Model and prior specification

Background model

Computation

Prior elicitation

Observation model

Bayes Factor

Computation

Diagnostics

References

R Package Documentation

Browse R Packages

We want your feedback!

lgaborini/bayessource Bayesian approach to evaluate whether two datasets come from the same population

In lgaborini/bayessource: Bayesian approach to evaluate whether two datasets come from the same population

Package goal

Package contents

Main functions

Usage

Sample data

Model and prior specification

Background model

Computation

Prior elicitation

Observation model

Bayes Factor

Computation

Diagnostics

References

R Package Documentation

Browse R Packages

We want your feedback!

lgaborini/bayessource
Bayesian approach to evaluate whether two datasets come from the same population