In ctgrubb/lemur.pack: Package for Team Lemur functional codebase

library(lemur.pack)
library(extraDistr)
library(knitr)
library(kableExtra)
library(proftools)
library(dplyr)
library(magrittr)

This vignette explains population synthesis for the discrete cases; in particular we will talk about the situations where subjects are described by the following:

(Case 1) A single binary $\mathbf{{0, 1}}$ variable
(Case 2) A single ternary (or higher) $\mathbf{{0, 1, \dots, c}}$ variable

In both cases, the only piece of information used to inform what the population looks like is a simple random sample of size $n$, which can be thought of as an $n$-vector $\mathbf{x}$, which is a realization of $\mathbf{X}$. Our goal is to synthesize numerous realizations of the population $\mathbf{Y}$ with size $N$. A realization of the population is denoted $\mathbf{y}$, and we will assume that $N$ is known.

Theory

For a population $\mathbf{Y}$, sample $\mathbf{X}$, and parameter set $\theta$, we can use Bayes' theorem to get the distribution for $\mathbf{Y}$, conditioned on our observed sample, $\mathbf{x}$.

\begin{align} f(\theta, \mathbf{Y} = \mathbf{y}| \mathbf{X} = \mathbf{x}) &= \frac{ f(\mathbf{X} = \mathbf{x} | \theta, \mathbf{Y} = \mathbf{y}) f(\theta, \mathbf{Y} = \mathbf{y}) }{ f(\mathbf{X} = \mathbf{x}) } \ &\propto f(\mathbf{X} = \mathbf{x} | \theta) f(\mathbf{Y} = \mathbf{y} | \theta) \pi(\theta) \end{align}

For the univariate discrete case, this is actually quite simple, however we do have a couple decisions to make:

What is the distribution of $f(\mathbf{X} = \mathbf{x} | \theta)$?
What prior $\pi(\theta)$ should we use?

The first question boils down to whether we believe our observed sample came from sampling with or without replacement. Usually, populations of people are sampled without replacement, in which case the proper likelihood would be $Hypergeometric$ or its multivariate extension, the $Multivariate\ Hypergeometric$. You may be confused by the word multivariate here; oddly, the name of the extension of the $Hypergeometric$ distribution to more than two categories is named the $Multivariate\ Hypergeometric$, even though it is still technically univariate.

If the population is sufficiently large, or we know it was sampled with replacement, we may choose to use a $Binomial$ or $Multinomial$ likelihood, instead. Sampling with replacement results in independent samples; sampling without replacement yields samples that are not independent. This document will not go into detail about the difference between these two, but you should be aware that extreme samples (e.g., a sample very close to being all $1$'s or all $0$'s) are more likely when sampling with replacement. In essence, this results in our distribution over $K$ (the number of successes in the population) having fatter tails when we use the $Binomial$ likelihood versus the $Hypergeometric$ likelihood. The table below summarizes which distribution we will use for our likelihood.

	Independence/Replacement?
Case	Yes	No
1	Binomial	Hypergeometric
2	Multinomial	MV Hypergeometric

Of course, the choice of prior $\pi(\theta)$ depends on the likelihood we choose. This document will focus on the likely choice; if we are using a $Binomial$ or $Multinomial$ likelihood, we probably will use some form of the conjugate prior, $Beta(a, b)$ or $Dirichlet(\alpha)$, respectively. In particular, the most common choices are $Beta(1, 1)$, the continuous uniform, and $Beta(1/2, 1/2)$, the Jeffreys prior, for the $Binomial$ case. Likewise, these have direct extensions in $Dirichlet(1, \dots, 1)$, and $Dirichlet(1/2, \dots, 1/2)$ for the $Multinomial$ case.

As we will see the multi-category extensions of these distributions does not change much. Thus, we will start with the binary versions and then discuss what changes when we switch to having three or more categories.

Binomial

Things are very simple if we choose to use a $Binomial$ likelihood. If we apply the above theory to the binomial, we get

\begin{align} f(\mathbf{Y} = \mathbf{y}, \mathbf{p} | \mathbf{X} = \mathbf{x}) &= \frac{ f(\mathbf{X} = \mathbf{x} | \mathbf{Y} = \mathbf{y}, p) f(\mathbf{Y} = \mathbf{y}, p) }{ f(\mathbf{X} = \mathbf{x}) } \ &\propto f(\mathbf{X} = \mathbf{x} | p ) f(\mathbf{Y} = \mathbf{y} | p) \pi(p) \ &\propto \pi(p | \mathbf{X} = \mathbf{x}) f(\mathbf{Y} = \mathbf{y} | p) \end{align}

Using the conjugate $Beta(a, b)$ prior gives us the canonical $Beta(K+a, N-K+b)$ posterior. As we will see, it is fairly straightforward to create an MCMC algorithm to jitter $Y$ and accept or reject using this posterior.

Multinomial

If we are instead using the $Multinomial$ likelihood, there is not much that changes. Notably, $p$ and $K$ are now vectors of length $c$, where $c$ is the number of categories, and our prior will likely be $Dirichlet(\alpha)$, the conjugate prior for the $Multinomial$ distribution. The new posterior with the conjugate prior is $Dirichlet(K_1 + \alpha_1, \dots, K_c + \alpha_c)$.

Hypergeometric

The common parameterization of the $Hypergeometric$ distribution can make this rather confusing. Instead of working with that, lets make our own.

Let $f(\mathbf{Y}, K, \mathbf{X}, J)$ denote the $Hypergeometric$ probability mass function, where:

$\mathbf{Y}$ is the population (i.e., the vector of ${0, 1}$ with length $N$)
$K$ is the number of successes ($1$'s) in the population $\mathbf{Y}$
$\mathbf{X}$ is the sample, with length $n$
$J$ is the number of successes ($1$'s) in the sample $\mathbf{X}$

Now, we can apply the above theory to our specific distribution, and we get

\begin{align} f(\mathbf{Y} = \mathbf{y}, \mathbf{K} | \mathbf{X} = \mathbf{x}, J) &= \frac{ f(\mathbf{X} = \mathbf{x}, J | \mathbf{Y} = \mathbf{y}, K) f(\mathbf{Y} = \mathbf{y}, K) }{ f(\mathbf{X} = \mathbf{x}, J) } \ &\propto f(\mathbf{X} = \mathbf{x}, J | K ) f(\mathbf{Y} = \mathbf{y} | K) \pi(K) \ &\propto \pi(K | \mathbf{X} = \mathbf{x}, J) f(\mathbf{Y} = \mathbf{y} | K) \end{align}

We also need to choose $\pi(K)$, the prior on the number of successes in the population. Above, if we were working with $p$ from the $Binomial$ distribution, we would likely want to use some version of the conjugate prior, $Beta(a, b)$.

From Jeffreys (1946, 1961), the induced prior on $K$ can be found by constructing a hierarchical prior of the form

$$ Binom(K | p) Beta(p | a, b) $$

and then marginalizing to find $\pi(K)$. Thus,

\begin{align*} \pi(K) &= \int_{0}^{1} {N \choose K} p^K (1-p)^{N-K} \frac{1}{\beta(a, b)} p^{a-1} (1-p)^{b-1} dp \ &= {N \choose K} \frac{\beta(K+a, N-K+b)}{\beta(a, b)} \ &= {N \choose K} \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} \frac{\Gamma(K+a) \Gamma(N-K+b)}{\Gamma(N+a+b)}

\end{align*}

For certain values of $a$ and $b$, this will simplify much further, but not in general. Lets visualize this prior for $N = 100$.

plot of chunk discrete-Jeffreys

Whatever choice of $a$ and $b$ we make, creating the posterior from here is fairly straightforward; of course, this does not simplify in any nice ways like the $Binomial$ would.

Multivariate Hypergeometric

We will define the $Multivariate\ Hypergeometric$ distribution in the same fashion as before, except this time $\mathbf{K}$ and $\mathbf{J}$ are vectors.

Additionally, we need to find a prior $\pi(\mathbf{K})$. Suppose there are $c$ categories; following the same steps as the simple 1-dimensional example, we have:

\begin{align*} \pi(\mathbf{K}) &= \int_{0}^{1} {N \choose {\mathbf{K}1, ..., \mathbf{K}_c}} \prod{i = 1}^c \mathbf{p}i^{\mathbf{K}_i} \frac{1}{\beta(\mathbf{\alpha})} \prod{i = 1}^c \mathbf{p}_i^{\mathbf{\alpha}_i - 1} d\mathbf{p} \ &= {N \choose {\mathbf{K}_1, ..., \mathbf{K}_c}} \frac{\beta(\mathbf{K} + \mathbf{\alpha})}{\beta(\mathbf{\alpha})}

\end{align*}

From here, everything is fairly straightforward.

Correction Factor

Returning to the distribution we are trying to sample from, we had

$$ f(\theta, \mathbf{Y} = \mathbf{y}| \mathbf{X} = \mathbf{x}) \propto \pi(\theta | \mathbf{X} = \mathbf{x}) f(\mathbf{Y} = \mathbf{y} | \theta). $$ Whether using the $Binomial$ or $Hypergeometric$ distribution, in both cases this $f(\mathbf{Y} = \mathbf{y} | \theta$ term is essential and cannot be forgotten; without it, the sample ends up essentially not mattering.

To illustrate, let $N = 10$. With such a small population, we can easily list every possible population. There are of course $2^{10}$ of them, since every subject in the population has two possible options. If we assume all possible populations are equally likely, we can visualize the probability distribution of $K$, the number of $1$'s in the population, with a simple bar plot.

plot of chunk correction-barplot

This means that without any prior knowledge, or even an observed sample, we would expect $K = 5$ to be $252$ times more likely than $K = 0$ or $K = 10$. This is clearly incorrect; if we know nothing about $K$, then the distribution over $K$ should be flat. This term fixes that problem; in the binary case it is simply the reciprocal of the binomial coefficient ${N \choose K}$.

This correction factor carries over to the multiclass method, but it now uses the multinomial coefficient.

Samplers

Here we will outline straightforward approaches to synthesis for all four possible distributions. There are a few things worth discussing further.

Should we make all populations include our sample? For example, if we observe a vector $x = (1, 0, 0, 0, 1, 0)$ and there are $c = 2$ classes, should all synthetic populations contain at least four $0$'s and two $1's$? If we are using the $Hypergeometric$ distribution, this actually happens automatically, but if we use the $Binomial$ distribution, we have the choice of enforing this criterion or not.

For the purpose of a more fair comparison, in this document we will not enforce that all populations contain the sample.

Binomial

Here is a very basic function that will jitter $\mathbf{Y}$, and use the posterior in terms of $p$ to determine whether to accept or reject synthesized populations.

popsim_binomial <- function(obs, N, samples = 1000, a = 0.5, b = 0.5) {
  logcorrection <- function(N, p) {
    -lchoose(N, N * p)
  }

  logprior <- function(p, a, b) {
    dbeta(p, a, b, log = TRUE)
  }

  logpost <- function(p, J, n, a, b) {
    dbinom(J, n, p, log = TRUE) + logprior(p, a, b) + logcorrection(N, p)
  }

  J <- sum(obs)
  n <- length(obs)

  out <- matrix(NA, nrow = samples + 1, ncol = N)
  out[1, ] <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - mean(obs), mean(obs)))

  current_lp <- logpost(mean(out[1, ]), J, n, a, b)

  for(i in 2:(samples + 1)) {
    current <- out[i - 1, ]
    for(k in 1:N) {
      proposal <- current
      proposal[k] <- as.integer(!current[k])
      proposal_lp <- logpost(mean(proposal), J, n, a, b)

      u <- runif(1)
      if(!is.na(proposal_lp) & log(u) <= proposal_lp - current_lp) {
        current <- proposal
        current_lp <- proposal_lp
      }
    }
    out[i, ] <- current
  }
  return(out[-1, ])
}

Multinomial

Not much has to change here. Depending how you code multiple categories, working with them can be quite difficult. The easiest set of categories to work with is ${1, 2, \dots, c}$. This simplifies things a lot, mainly because it allows us to use tabulate, which speeds up the process considerably. The code below will fail if you pass it a sample with either of the following properties:

A category denoted as $0$ (e.g. obs = c(1, 0, 1, 0, 1, 2, 1, 2))
Either missing or non-consecutive categories (e.g. obs = c(1, 3, 1, 1, 3, 1, 3, 3)) -- it will assume that there is a category 2, and just not observed in the sample, which likely isn't what you want

This may be confusing because in the binary cases we need our samples to be ${0, 1}$, but when working with more than two levels things can actually get quite complicated, and enforcing this simple criterion saves us a lot of headache. It is relatively simple to sandwich this method between two transformations to go from the original levels to positive integers, and then back to the original levels at the end.

popsim_multinomial <- function(obs, N, samples = 1000, alpha = rep(0.5, length(unique(obs)))) {
  logcorrection <- function(N, K) {
    -lmnchoose(N, K)
  }

  logprior <- function(p, alpha) {
    ddirichlet(p, alpha, log = TRUE)
  }

  logpost <- function(K, N, J, n, alpha) {
    p <- K / N
    dmultinom(J, prob = p, log = TRUE) + logprior(p, alpha) + logcorrection(N, K)
  }

  J <- tabulate(obs)
  uniq <- sort(unique(obs))
  classes <- max(uniq)
  n <- length(obs)

  out <- matrix(NA, nrow = samples + 1, ncol = N)
  out[1, ] <- rep(obs, length.out = N)

  current_lp <- logpost(tabulate(out[1, ], nbins = classes), N, J, n, alpha)

  for(i in 2:(samples + 1)) {
    current <- out[i - 1, ]
    for(k in 1:N) {
      proposal <- current
      proposal[k] <- sample(uniq[-proposal[k]], 1)
      proposal_lp <- logpost(tabulate(proposal, nbins = classes), N, J, n, alpha)

      u <- runif(1)
      if(!is.na(proposal_lp) & log(u) <= proposal_lp - current_lp) {
        current <- proposal
        current_lp <- proposal_lp
      }
    }
    out[i, ] <- current
  }
  return(out[-1, ])
}

Hypergeometric

Here, we do the same thing we did for the $Binomial$ case, but instead use the posterior in terms of $K$ to determine whether to accept or reject synthesized populations.

popsim_hypergeometric <- function(obs, N, samples = 1000, a = 0.5, b = 0.5) {
  logcorrection <- function(N, K) {
    -lchoose(N, K)
  }

  logprior <- function(K, N, a, b) {
    lchoose(N, K) + lbeta(K + a, N - K + b) - lbeta(a, b)
  }

  logpost <- function(K, N, J, n, a, b) {
    dhyper(J, K, N - K, n, log = TRUE) + logprior(K, N, a, b) + logcorrection(N, K)
  }

  J <- sum(obs)
  n <- length(obs)

  out <- matrix(NA, nrow = samples + 1, ncol = N)
  out[1, ] <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - mean(obs), mean(obs)))

  current_lp <- logpost(sum(out[1, ]), N, J, n, a, b)

  for(i in 2:(samples + 1)) {
    current <- out[i - 1, ]
    for(k in 1:N) {
      proposal <- current
      proposal[k] <- as.integer(!current[k])
      proposal_lp <- logpost(sum(proposal), N, J, n, a, b)

      u <- runif(1)
      if(!is.na(proposal_lp) & log(u) <= proposal_lp - current_lp) {
        current <- proposal
        current_lp <- proposal_lp
      }
    }
    out[i, ] <- current
  }
  return(out[-1, ])
}

Multivariate Hypergeometric

This is also pretty straightforward; however, we do have to follow the rules we outlined for the $Multinomial$ method. The only difference between this an the $Multinomial$ method is that we use the posterior with respect to $K$ that accounts for sampling without replacement.

popsim_mvhypergeometric <- function(obs, N, samples = 1000, alpha = rep(0.5, length(unique(obs)))) {
  logcorrection <- function(N, K) {
    -lmnchoose(N, K)
  }

  logprior <- function(K, N, alpha) {
    lmnchoose(N, K) + lmvbeta(K + alpha) - lmvbeta(alpha)
  }

  logpost <- function(K, N, J, n, alpha) {
    dmvhyper(J, K, n, log = TRUE) + logprior(K, N, alpha) + logcorrection(N, K)
  }

  J <- tabulate(obs)
  uniq <- sort(unique(obs))
  classes <- max(uniq)
  n <- length(obs)

  out <- matrix(NA, nrow = samples + 1, ncol = N)
  out[1, ] <- rep(obs, length.out = N)


  current_lp <- logpost(tabulate(out[1, ], nbins = classes), N, J, n, alpha)

  for(i in 2:(samples + 1)) {
    current <- out[i - 1, ]
    for(k in 1:N) {
      proposal <- current
      proposal[k] <- sample(uniq[-proposal[k]], 1)
      proposal_lp <- logpost(tabulate(proposal, nbins = classes), N, J, n, alpha)

      u <- runif(1)
      if(!is.na(proposal_lp) & log(u) <= proposal_lp - current_lp) {
        current <- proposal
        current_lp <- proposal_lp
      }
    }
    out[i, ] <- current
  }
  return(out[-1, ])
}

Comparisons

Setup

The first thing we need is our samples. We are going to use the following:

obs_binary <- c(1, 0, 0, 0, 0, 0, 1, 1, 0, 1)   # 60% / 40%
obs_mc <- c(1, 2, 3, 1, 1, 2, 1, 2, 3, 1)       # 50% / 30% / 20%

Location

For the binary methods, we can compare the posterior of how many $1$'s are in the populations to see if the two methods give similar results. Here are the densities for N = 100 (left) and N = 1000 (right), both using 3 × 10⁴ synthetic populations. The $Binomial$ method is represented in red, and the $Hypergeometric$ in blue.

plot of chunk denscomp_binary

For the multiclass methods, we can instead look at a specific margin, e.g. the margin for category $1$. Here are the densities for N = 100 (left) and N = 1000 (right), both using 5000 synthetic populations. The $Multinomial$ method is represented in red, and the $Multivariate\ Hypergeometric$ in blue.

plot of chunk denscomp_mc

Time

Likewise, we can look at how long each method takes. None of these functions are optimized in any way; though they are written to be as similar as possible so the comparisons are as useful as possible. The total time is representative of creating 3 × 10⁴ synthetic populations for the binary methods, and 5000 synthetic populations for the multiclass methods.

Profile for Binomial, N = 100 (left) vs. N = 1000 (right)

	Total %	Total Time	Self %	Self Time
popsim_binomial	80.24	86.48	26.76	28.84
logpost	53.48	57.64	22.08	23.80
dbinom	22.43	24.18	13.55	14.60
<Other>	19.76	21.30	19.76	21.30
logprior	8.96	9.66	8.96	9.66
mean	8.89	9.58	7.24	7.80
mean.default	1.65	1.78	1.65	1.78

	Total %	Total Time	Self %	Self Time
popsim_binomial	79.57	791.66	23.24	231.18
logpost	56.33	560.48	20.13	200.24
dbinom	27.58	274.38	12.70	126.34
<Other>	20.43	203.28	20.43	203.28
mean	14.88	148.04	13.23	131.62
logprior	8.63	85.86	8.63	85.86
mean.default	1.65	16.42	1.65	16.42

Profile for Hypergeometric, N = 100 (left) vs. N = 1000 (right)

	Total %	Total Time	Self %	Self Time
popsim_hypergeometric	75.72	54.44	34.52	24.82
logpost	41.20	29.62	37.41	26.90
<Other>	24.28	17.46	24.28	17.46
dhyper	3.78	2.72	3.78	2.72

	Total %	Total Time	Self %	Self Time
popsim_hypergeometric	72.44	531.20	30.27	221.96
logpost	42.17	309.24	34.17	250.54
<Other>	27.56	202.10	27.56	202.10
dhyper	8.00	58.70	8.00	58.70

Profile for Multinomial, N = 100 (left) vs. N = 1000 (right)

	Total %	Total Time	Self %	Self Time
popsim_multinomial	99.95	10024.52	99.23	9952.64
logpost	0.66	66.38	0.12	11.66
logcorrection	0.38	37.84	0.05	5.32
lmnchoose	0.32	32.52	0.04	4.02
sapply	0.28	28.50	0.10	10.02
logprior	0.13	13.12	0.04	4.08
simplify2array	0.10	9.76	0.06	5.68
ddirichlet	0.09	9.04	0.09	8.92
lapply	0.08	8.30	0.07	6.96
sample	0.05	5.50	0.05	4.76

	Total %	Total Time	Self %	Self Time
popsim_multinomial	95.16	1081.04	8.25	93.70
logpost	83.24	945.64	9.06	102.90
logcorrection	57.59	654.22	19.04	216.32
lmnchoose	38.54	437.90	3.57	40.52
sapply	34.98	397.38	6.47	73.52
lapply	21.62	245.58	20.10	228.36
logprior	10.45	118.68	3.09	35.08
ddirichlet	7.36	83.60	7.21	81.94
simplify2array	6.50	73.84	3.82	43.44
<Other>	4.84	55.04	4.84	55.04

Profile for Multivariate Hypergeometric, N = 100 (left) vs. N = 1000 (right)

	Total %	Total Time	Self %	Self Time
popsim_mvhypergeometric	94.58	104.62	8.35	9.24
logpost	81.96	90.66	6.64	7.34
lmnchoose	49.39	54.64	6.89	7.62
sapply	42.51	47.02	14.97	16.56
logprior	33.70	37.28	7.23	8.00
logcorrection	28.02	31.00	3.98	4.40
simplify2array	14.23	15.74	7.97	8.82
dmvhyper	13.60	15.04	12.33	13.64
lapply	12.51	13.84	10.34	11.44
unique	6.26	6.92	4.99	5.52

	Total %	Total Time	Self %	Self Time
popsim_mvhypergeometric	96.90	1665.80	4.48	77.02
logpost	89.95	1546.32	3.60	61.90
lmnchoose	49.52	851.36	4.53	77.90
sapply	44.99	773.46	8.21	141.20
logprior	40.34	693.40	14.33	246.40
logcorrection	36.64	629.94	12.43	213.64
lapply	27.95	480.46	25.91	445.34
dmvhyper	9.37	161.08	6.91	118.72
simplify2array	8.32	143.10	4.72	81.08
unique	3.61	62.02	2.77	47.56

Coverage

We can also consider the empirical coverage rates of credible intervals. We will only include the binary methods here, because the multiclass methods take far too long using the simple formulations above. They can be optimized further, of course; for large $N$ it can be beneficial to update the table during the iterations instead of recalculating it. We haven't done so here because we wanted the most simple formulations. For these coverage rates, we will only use $N = 100,\ n = 10$, and we create 100 synthetic populations for each of the 1000 samples per value of $K$.

Below we see the empirical coverage rates for equal-tailed credible intervals capturing the true $K$, the number of $1$'s in the population. Both methods do quite well with K close to $0.5 * N$; however, the $Binomial$ approximation does quite poorly compared to the $Hypergeometric$ method as we approach either tail.

Method	K
	3	10	17	23	30	37	43	50	57	63	70	77	83	90	97
Binomial	21.2	61.0	80.6	88.0	88.0	90.6	89.2	89.2	88.1	88.1	86.1	84.8	79.7	60.2	23.2
Hypergeometric	95.9	88.9	83.9	87.1	87.5	88.6	88.5	87.0	86.3	87.0	83.4	83.4	82.6	86.8	95.1

ctgrubb/lemur.pack documentation built on May 7, 2023, 4:13 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

ctgrubb/lemur.pack
Package for Team Lemur functional codebase

In ctgrubb/lemur.pack: Package for Team Lemur functional codebase

Theory

Binomial

Multinomial

Hypergeometric

Multivariate Hypergeometric

Correction Factor

Samplers

Binomial

Multinomial

Hypergeometric

Multivariate Hypergeometric

Comparisons

Setup

Location

Time

Coverage

R Package Documentation

Browse R Packages

We want your feedback!

ctgrubb/lemur.pack Package for Team Lemur functional codebase

In ctgrubb/lemur.pack: Package for Team Lemur functional codebase

Theory

Binomial

Multinomial

Hypergeometric

Multivariate Hypergeometric

Correction Factor

Samplers

Binomial

Multinomial

Hypergeometric

Multivariate Hypergeometric

Comparisons

Setup

Location

Time

Coverage

R Package Documentation

Browse R Packages

We want your feedback!

ctgrubb/lemur.pack
Package for Team Lemur functional codebase