library(lemur.pack)
library(extraDistr)
library(knitr)
library(kableExtra)
library(proftools)
library(dplyr)
library(magrittr)

This vignette explains population synthesis for the discrete cases; in particular we will talk about the situations where subjects are described by the following:

In both cases, the only piece of information used to inform what the population looks like is a simple random sample of size $n$, which can be thought of as an $n$-vector $\mathbf{x}$, which is a realization of $\mathbf{X}$. Our goal is to synthesize numerous realizations of the population $\mathbf{Y}$ with size $N$. A realization of the population is denoted $\mathbf{y}$, and we will assume that $N$ is known.

Theory

For a population $\mathbf{Y}$, sample $\mathbf{X}$, and parameter set $\theta$, we can use Bayes' theorem to get the distribution for $\mathbf{Y}$, conditioned on our observed sample, $\mathbf{x}$.

\begin{align} f(\theta, \mathbf{Y} = \mathbf{y}| \mathbf{X} = \mathbf{x}) &= \frac{ f(\mathbf{X} = \mathbf{x} | \theta, \mathbf{Y} = \mathbf{y}) f(\theta, \mathbf{Y} = \mathbf{y}) }{ f(\mathbf{X} = \mathbf{x}) } \ &\propto f(\mathbf{X} = \mathbf{x} | \theta) f(\mathbf{Y} = \mathbf{y} | \theta) \pi(\theta) \end{align}

For the univariate discrete case, this is actually quite simple, however we do have a couple decisions to make:

The first question boils down to whether we believe our observed sample came from sampling with or without replacement. Usually, populations of people are sampled without replacement, in which case the proper likelihood would be $Hypergeometric$ or its multivariate extension, the $Multivariate\ Hypergeometric$. You may be confused by the word multivariate here; oddly, the name of the extension of the $Hypergeometric$ distribution to more than two categories is named the $Multivariate\ Hypergeometric$, even though it is still technically univariate.

If the population is sufficiently large, or we know it was sampled with replacement, we may choose to use a $Binomial$ or $Multinomial$ likelihood, instead. Sampling with replacement results in independent samples; sampling without replacement yields samples that are not independent. This document will not go into detail about the difference between these two, but you should be aware that extreme samples (e.g., a sample very close to being all $1$'s or all $0$'s) are more likely when sampling with replacement. In essence, this results in our distribution over $K$ (the number of successes in the population) having fatter tails when we use the $Binomial$ likelihood versus the $Hypergeometric$ likelihood. The table below summarizes which distribution we will use for our likelihood.

Independence/Replacement?
Case Yes No
1 Binomial Hypergeometric
2 Multinomial MV Hypergeometric

Of course, the choice of prior $\pi(\theta)$ depends on the likelihood we choose. This document will focus on the likely choice; if we are using a $Binomial$ or $Multinomial$ likelihood, we probably will use some form of the conjugate prior, $Beta(a, b)$ or $Dirichlet(\alpha)$, respectively. In particular, the most common choices are $Beta(1, 1)$, the continuous uniform, and $Beta(1/2, 1/2)$, the Jeffreys prior, for the $Binomial$ case. Likewise, these have direct extensions in $Dirichlet(1, \dots, 1)$, and $Dirichlet(1/2, \dots, 1/2)$ for the $Multinomial$ case.

As we will see the multi-category extensions of these distributions does not change much. Thus, we will start with the binary versions and then discuss what changes when we switch to having three or more categories.

Binomial

Things are very simple if we choose to use a $Binomial$ likelihood. If we apply the above theory to the binomial, we get

\begin{align} f(\mathbf{Y} = \mathbf{y}, \mathbf{p} | \mathbf{X} = \mathbf{x}) &= \frac{ f(\mathbf{X} = \mathbf{x} | \mathbf{Y} = \mathbf{y}, p) f(\mathbf{Y} = \mathbf{y}, p) }{ f(\mathbf{X} = \mathbf{x}) } \ &\propto f(\mathbf{X} = \mathbf{x} | p ) f(\mathbf{Y} = \mathbf{y} | p) \pi(p) \ &\propto \pi(p | \mathbf{X} = \mathbf{x}) f(\mathbf{Y} = \mathbf{y} | p) \end{align}

Using the conjugate $Beta(a, b)$ prior gives us the canonical $Beta(K+a, N-K+b)$ posterior. As we will see, it is fairly straightforward to create an MCMC algorithm to jitter $Y$ and accept or reject using this posterior.

Multinomial

If we are instead using the $Multinomial$ likelihood, there is not much that changes. Notably, $p$ and $K$ are now vectors of length $c$, where $c$ is the number of categories, and our prior will likely be $Dirichlet(\alpha)$, the conjugate prior for the $Multinomial$ distribution. The new posterior with the conjugate prior is $Dirichlet(K_1 + \alpha_1, \dots, K_c + \alpha_c)$.

Hypergeometric

The common parameterization of the $Hypergeometric$ distribution can make this rather confusing. Instead of working with that, lets make our own.

Let $f(\mathbf{Y}, K, \mathbf{X}, J)$ denote the $Hypergeometric$ probability mass function, where:

Now, we can apply the above theory to our specific distribution, and we get

\begin{align} f(\mathbf{Y} = \mathbf{y}, \mathbf{K} | \mathbf{X} = \mathbf{x}, J) &= \frac{ f(\mathbf{X} = \mathbf{x}, J | \mathbf{Y} = \mathbf{y}, K) f(\mathbf{Y} = \mathbf{y}, K) }{ f(\mathbf{X} = \mathbf{x}, J) } \ &\propto f(\mathbf{X} = \mathbf{x}, J | K ) f(\mathbf{Y} = \mathbf{y} | K) \pi(K) \ &\propto \pi(K | \mathbf{X} = \mathbf{x}, J) f(\mathbf{Y} = \mathbf{y} | K) \end{align}

We also need to choose $\pi(K)$, the prior on the number of successes in the population. Above, if we were working with $p$ from the $Binomial$ distribution, we would likely want to use some version of the conjugate prior, $Beta(a, b)$.

From Jeffreys (1946, 1961), the induced prior on $K$ can be found by constructing a hierarchical prior of the form

$$ Binom(K | p) Beta(p | a, b) $$

and then marginalizing to find $\pi(K)$. Thus,

\begin{align*} \pi(K) &= \int_{0}^{1} {N \choose K} p^K (1-p)^{N-K} \frac{1}{\beta(a, b)} p^{a-1} (1-p)^{b-1} dp \ &= {N \choose K} \frac{\beta(K+a, N-K+b)}{\beta(a, b)} \ &= {N \choose K} \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} \frac{\Gamma(K+a) \Gamma(N-K+b)}{\Gamma(N+a+b)}

\end{align*}

For certain values of $a$ and $b$, this will simplify much further, but not in general. Lets visualize this prior for $N = 100$.

plot of chunk discrete-Jeffreys

Whatever choice of $a$ and $b$ we make, creating the posterior from here is fairly straightforward; of course, this does not simplify in any nice ways like the $Binomial$ would.

Multivariate Hypergeometric

We will define the $Multivariate\ Hypergeometric$ distribution in the same fashion as before, except this time $\mathbf{K}$ and $\mathbf{J}$ are vectors.

Additionally, we need to find a prior $\pi(\mathbf{K})$. Suppose there are $c$ categories; following the same steps as the simple 1-dimensional example, we have:

\begin{align*} \pi(\mathbf{K}) &= \int_{0}^{1} {N \choose {\mathbf{K}1, ..., \mathbf{K}_c}} \prod{i = 1}^c \mathbf{p}i^{\mathbf{K}_i} \frac{1}{\beta(\mathbf{\alpha})} \prod{i = 1}^c \mathbf{p}_i^{\mathbf{\alpha}_i - 1} d\mathbf{p} \ &= {N \choose {\mathbf{K}_1, ..., \mathbf{K}_c}} \frac{\beta(\mathbf{K} + \mathbf{\alpha})}{\beta(\mathbf{\alpha})}

\end{align*}

From here, everything is fairly straightforward.

Correction Factor

Returning to the distribution we are trying to sample from, we had

$$ f(\theta, \mathbf{Y} = \mathbf{y}| \mathbf{X} = \mathbf{x}) \propto \pi(\theta | \mathbf{X} = \mathbf{x}) f(\mathbf{Y} = \mathbf{y} | \theta). $$ Whether using the $Binomial$ or $Hypergeometric$ distribution, in both cases this $f(\mathbf{Y} = \mathbf{y} | \theta$ term is essential and cannot be forgotten; without it, the sample ends up essentially not mattering.

To illustrate, let $N = 10$. With such a small population, we can easily list every possible population. There are of course $2^{10}$ of them, since every subject in the population has two possible options. If we assume all possible populations are equally likely, we can visualize the probability distribution of $K$, the number of $1$'s in the population, with a simple bar plot.

plot of chunk correction-barplot

This means that without any prior knowledge, or even an observed sample, we would expect $K = 5$ to be $252$ times more likely than $K = 0$ or $K = 10$. This is clearly incorrect; if we know nothing about $K$, then the distribution over $K$ should be flat. This term fixes that problem; in the binary case it is simply the reciprocal of the binomial coefficient ${N \choose K}$.

This correction factor carries over to the multiclass method, but it now uses the multinomial coefficient.

Samplers

Here we will outline straightforward approaches to synthesis for all four possible distributions. There are a few things worth discussing further.

Should we make all populations include our sample? For example, if we observe a vector $x = (1, 0, 0, 0, 1, 0)$ and there are $c = 2$ classes, should all synthetic populations contain at least four $0$'s and two $1's$? If we are using the $Hypergeometric$ distribution, this actually happens automatically, but if we use the $Binomial$ distribution, we have the choice of enforing this criterion or not.

For the purpose of a more fair comparison, in this document we will not enforce that all populations contain the sample.

Binomial

Here is a very basic function that will jitter $\mathbf{Y}$, and use the posterior in terms of $p$ to determine whether to accept or reject synthesized populations.

popsim_binomial <- function(obs, N, samples = 1000, a = 0.5, b = 0.5) {
  logcorrection <- function(N, p) {
    -lchoose(N, N * p)
  }

  logprior <- function(p, a, b) {
    dbeta(p, a, b, log = TRUE)
  }

  logpost <- function(p, J, n, a, b) {
    dbinom(J, n, p, log = TRUE) + logprior(p, a, b) + logcorrection(N, p)
  }

  J <- sum(obs)
  n <- length(obs)

  out <- matrix(NA, nrow = samples + 1, ncol = N)
  out[1, ] <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - mean(obs), mean(obs)))

  current_lp <- logpost(mean(out[1, ]), J, n, a, b)

  for(i in 2:(samples + 1)) {
    current <- out[i - 1, ]
    for(k in 1:N) {
      proposal <- current
      proposal[k] <- as.integer(!current[k])
      proposal_lp <- logpost(mean(proposal), J, n, a, b)

      u <- runif(1)
      if(!is.na(proposal_lp) & log(u) <= proposal_lp - current_lp) {
        current <- proposal
        current_lp <- proposal_lp
      }
    }
    out[i, ] <- current
  }
  return(out[-1, ])
}

Multinomial

Not much has to change here. Depending how you code multiple categories, working with them can be quite difficult. The easiest set of categories to work with is ${1, 2, \dots, c}$. This simplifies things a lot, mainly because it allows us to use tabulate, which speeds up the process considerably. The code below will fail if you pass it a sample with either of the following properties:

This may be confusing because in the binary cases we need our samples to be ${0, 1}$, but when working with more than two levels things can actually get quite complicated, and enforcing this simple criterion saves us a lot of headache. It is relatively simple to sandwich this method between two transformations to go from the original levels to positive integers, and then back to the original levels at the end.

popsim_multinomial <- function(obs, N, samples = 1000, alpha = rep(0.5, length(unique(obs)))) {
  logcorrection <- function(N, K) {
    -lmnchoose(N, K)
  }

  logprior <- function(p, alpha) {
    ddirichlet(p, alpha, log = TRUE)
  }

  logpost <- function(K, N, J, n, alpha) {
    p <- K / N
    dmultinom(J, prob = p, log = TRUE) + logprior(p, alpha) + logcorrection(N, K)
  }

  J <- tabulate(obs)
  uniq <- sort(unique(obs))
  classes <- max(uniq)
  n <- length(obs)

  out <- matrix(NA, nrow = samples + 1, ncol = N)
  out[1, ] <- rep(obs, length.out = N)

  current_lp <- logpost(tabulate(out[1, ], nbins = classes), N, J, n, alpha)

  for(i in 2:(samples + 1)) {
    current <- out[i - 1, ]
    for(k in 1:N) {
      proposal <- current
      proposal[k] <- sample(uniq[-proposal[k]], 1)
      proposal_lp <- logpost(tabulate(proposal, nbins = classes), N, J, n, alpha)

      u <- runif(1)
      if(!is.na(proposal_lp) & log(u) <= proposal_lp - current_lp) {
        current <- proposal
        current_lp <- proposal_lp
      }
    }
    out[i, ] <- current
  }
  return(out[-1, ])
}

Hypergeometric

Here, we do the same thing we did for the $Binomial$ case, but instead use the posterior in terms of $K$ to determine whether to accept or reject synthesized populations.

popsim_hypergeometric <- function(obs, N, samples = 1000, a = 0.5, b = 0.5) {
  logcorrection <- function(N, K) {
    -lchoose(N, K)
  }

  logprior <- function(K, N, a, b) {
    lchoose(N, K) + lbeta(K + a, N - K + b) - lbeta(a, b)
  }

  logpost <- function(K, N, J, n, a, b) {
    dhyper(J, K, N - K, n, log = TRUE) + logprior(K, N, a, b) + logcorrection(N, K)
  }

  J <- sum(obs)
  n <- length(obs)

  out <- matrix(NA, nrow = samples + 1, ncol = N)
  out[1, ] <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - mean(obs), mean(obs)))

  current_lp <- logpost(sum(out[1, ]), N, J, n, a, b)

  for(i in 2:(samples + 1)) {
    current <- out[i - 1, ]
    for(k in 1:N) {
      proposal <- current
      proposal[k] <- as.integer(!current[k])
      proposal_lp <- logpost(sum(proposal), N, J, n, a, b)

      u <- runif(1)
      if(!is.na(proposal_lp) & log(u) <= proposal_lp - current_lp) {
        current <- proposal
        current_lp <- proposal_lp
      }
    }
    out[i, ] <- current
  }
  return(out[-1, ])
}

Multivariate Hypergeometric

This is also pretty straightforward; however, we do have to follow the rules we outlined for the $Multinomial$ method. The only difference between this an the $Multinomial$ method is that we use the posterior with respect to $K$ that accounts for sampling without replacement.

popsim_mvhypergeometric <- function(obs, N, samples = 1000, alpha = rep(0.5, length(unique(obs)))) {
  logcorrection <- function(N, K) {
    -lmnchoose(N, K)
  }

  logprior <- function(K, N, alpha) {
    lmnchoose(N, K) + lmvbeta(K + alpha) - lmvbeta(alpha)
  }

  logpost <- function(K, N, J, n, alpha) {
    dmvhyper(J, K, n, log = TRUE) + logprior(K, N, alpha) + logcorrection(N, K)
  }

  J <- tabulate(obs)
  uniq <- sort(unique(obs))
  classes <- max(uniq)
  n <- length(obs)

  out <- matrix(NA, nrow = samples + 1, ncol = N)
  out[1, ] <- rep(obs, length.out = N)


  current_lp <- logpost(tabulate(out[1, ], nbins = classes), N, J, n, alpha)

  for(i in 2:(samples + 1)) {
    current <- out[i - 1, ]
    for(k in 1:N) {
      proposal <- current
      proposal[k] <- sample(uniq[-proposal[k]], 1)
      proposal_lp <- logpost(tabulate(proposal, nbins = classes), N, J, n, alpha)

      u <- runif(1)
      if(!is.na(proposal_lp) & log(u) <= proposal_lp - current_lp) {
        current <- proposal
        current_lp <- proposal_lp
      }
    }
    out[i, ] <- current
  }
  return(out[-1, ])
}

Comparisons

Setup

The first thing we need is our samples. We are going to use the following:

obs_binary <- c(1, 0, 0, 0, 0, 0, 1, 1, 0, 1)   # 60% / 40%
obs_mc <- c(1, 2, 3, 1, 1, 2, 1, 2, 3, 1)       # 50% / 30% / 20%

Location

For the binary methods, we can compare the posterior of how many $1$'s are in the populations to see if the two methods give similar results. Here are the densities for N = 100 (left) and N = 1000 (right), both using 3 × 104 synthetic populations. The $Binomial$ method is represented in red, and the $Hypergeometric$ in blue.

plot of chunk denscomp_binary

For the multiclass methods, we can instead look at a specific margin, e.g. the margin for category $1$. Here are the densities for N = 100 (left) and N = 1000 (right), both using 5000 synthetic populations. The $Multinomial$ method is represented in red, and the $Multivariate\ Hypergeometric$ in blue.

plot of chunk denscomp_mc

Time

Likewise, we can look at how long each method takes. None of these functions are optimized in any way; though they are written to be as similar as possible so the comparisons are as useful as possible. The total time is representative of creating 3 × 104 synthetic populations for the binary methods, and 5000 synthetic populations for the multiclass methods.

Profile for Binomial, N = 100 (left) vs. N = 1000 (right)
Total % Total Time Self % Self Time
popsim_binomial 80.24 86.48 26.76 28.84
logpost 53.48 57.64 22.08 23.80
dbinom 22.43 24.18 13.55 14.60
<Other> 19.76 21.30 19.76 21.30
logprior 8.96 9.66 8.96 9.66
mean 8.89 9.58 7.24 7.80
mean.default 1.65 1.78 1.65 1.78
Total % Total Time Self % Self Time
popsim_binomial 79.57 791.66 23.24 231.18
logpost 56.33 560.48 20.13 200.24
dbinom 27.58 274.38 12.70 126.34
<Other> 20.43 203.28 20.43 203.28
mean 14.88 148.04 13.23 131.62
logprior 8.63 85.86 8.63 85.86
mean.default 1.65 16.42 1.65 16.42
Profile for Hypergeometric, N = 100 (left) vs. N = 1000 (right)
Total % Total Time Self % Self Time
popsim_hypergeometric 75.72 54.44 34.52 24.82
logpost 41.20 29.62 37.41 26.90
<Other> 24.28 17.46 24.28 17.46
dhyper 3.78 2.72 3.78 2.72
Total % Total Time Self % Self Time
popsim_hypergeometric 72.44 531.20 30.27 221.96
logpost 42.17 309.24 34.17 250.54
<Other> 27.56 202.10 27.56 202.10
dhyper 8.00 58.70 8.00 58.70
Profile for Multinomial, N = 100 (left) vs. N = 1000 (right)
Total % Total Time Self % Self Time
popsim_multinomial 99.95 10024.52 99.23 9952.64
logpost 0.66 66.38 0.12 11.66
logcorrection 0.38 37.84 0.05 5.32
lmnchoose 0.32 32.52 0.04 4.02
sapply 0.28 28.50 0.10 10.02
logprior 0.13 13.12 0.04 4.08
simplify2array 0.10 9.76 0.06 5.68
ddirichlet 0.09 9.04 0.09 8.92
lapply 0.08 8.30 0.07 6.96
sample 0.05 5.50 0.05 4.76
Total % Total Time Self % Self Time
popsim_multinomial 95.16 1081.04 8.25 93.70
logpost 83.24 945.64 9.06 102.90
logcorrection 57.59 654.22 19.04 216.32
lmnchoose 38.54 437.90 3.57 40.52
sapply 34.98 397.38 6.47 73.52
lapply 21.62 245.58 20.10 228.36
logprior 10.45 118.68 3.09 35.08
ddirichlet 7.36 83.60 7.21 81.94
simplify2array 6.50 73.84 3.82 43.44
<Other> 4.84 55.04 4.84 55.04
Profile for Multivariate Hypergeometric, N = 100 (left) vs. N = 1000 (right)
Total % Total Time Self % Self Time
popsim_mvhypergeometric 94.58 104.62 8.35 9.24
logpost 81.96 90.66 6.64 7.34
lmnchoose 49.39 54.64 6.89 7.62
sapply 42.51 47.02 14.97 16.56
logprior 33.70 37.28 7.23 8.00
logcorrection 28.02 31.00 3.98 4.40
simplify2array 14.23 15.74 7.97 8.82
dmvhyper 13.60 15.04 12.33 13.64
lapply 12.51 13.84 10.34 11.44
unique 6.26 6.92 4.99 5.52
Total % Total Time Self % Self Time
popsim_mvhypergeometric 96.90 1665.80 4.48 77.02
logpost 89.95 1546.32 3.60 61.90
lmnchoose 49.52 851.36 4.53 77.90
sapply 44.99 773.46 8.21 141.20
logprior 40.34 693.40 14.33 246.40
logcorrection 36.64 629.94 12.43 213.64
lapply 27.95 480.46 25.91 445.34
dmvhyper 9.37 161.08 6.91 118.72
simplify2array 8.32 143.10 4.72 81.08
unique 3.61 62.02 2.77 47.56

Coverage

We can also consider the empirical coverage rates of credible intervals. We will only include the binary methods here, because the multiclass methods take far too long using the simple formulations above. They can be optimized further, of course; for large $N$ it can be beneficial to update the table during the iterations instead of recalculating it. We haven't done so here because we wanted the most simple formulations. For these coverage rates, we will only use $N = 100,\ n = 10$, and we create 100 synthetic populations for each of the 1000 samples per value of $K$.

Below we see the empirical coverage rates for equal-tailed credible intervals capturing the true $K$, the number of $1$'s in the population. Both methods do quite well with K close to $0.5 * N$; however, the $Binomial$ approximation does quite poorly compared to the $Hypergeometric$ method as we approach either tail.

Method
K
3 10 17 23 30 37 43 50 57 63 70 77 83 90 97
Binomial 21.2 61.0 80.6 88.0 88.0 90.6 89.2 89.2 88.1 88.1 86.1 84.8 79.7 60.2 23.2
Hypergeometric 95.9 88.9 83.9 87.1 87.5 88.6 88.5 87.0 86.3 87.0 83.4 83.4 82.6 86.8 95.1


ctgrubb/lemur.pack documentation built on May 7, 2023, 4:13 a.m.