library(lemur.pack) library(extraDistr) library(knitr) library(kableExtra) library(proftools) library(dplyr) library(magrittr)
This vignette explains population synthesis for the discrete cases; in particular we will talk about the situations where subjects are described by the following:
In both cases, the only piece of information used to inform what the population looks like is a simple random sample of size $n$, which can be thought of as an $n$-vector $\mathbf{x}$, which is a realization of $\mathbf{X}$. Our goal is to synthesize numerous realizations of the population $\mathbf{Y}$ with size $N$. A realization of the population is denoted $\mathbf{y}$, and we will assume that $N$ is known.
For a population $\mathbf{Y}$, sample $\mathbf{X}$, and parameter set $\theta$, we can use Bayes' theorem to get the distribution for $\mathbf{Y}$, conditioned on our observed sample, $\mathbf{x}$.
\begin{align} f(\theta, \mathbf{Y} = \mathbf{y}| \mathbf{X} = \mathbf{x}) &= \frac{ f(\mathbf{X} = \mathbf{x} | \theta, \mathbf{Y} = \mathbf{y}) f(\theta, \mathbf{Y} = \mathbf{y}) }{ f(\mathbf{X} = \mathbf{x}) } \ &\propto f(\mathbf{X} = \mathbf{x} | \theta) f(\mathbf{Y} = \mathbf{y} | \theta) \pi(\theta) \end{align}
For the univariate discrete case, this is actually quite simple, however we do have a couple decisions to make:
The first question boils down to whether we believe our observed sample came from sampling with or without replacement. Usually, populations of people are sampled without replacement, in which case the proper likelihood would be $Hypergeometric$ or its multivariate extension, the $Multivariate\ Hypergeometric$. You may be confused by the word multivariate here; oddly, the name of the extension of the $Hypergeometric$ distribution to more than two categories is named the $Multivariate\ Hypergeometric$, even though it is still technically univariate.
If the population is sufficiently large, or we know it was sampled with replacement, we may choose to use a $Binomial$ or $Multinomial$ likelihood, instead. Sampling with replacement results in independent samples; sampling without replacement yields samples that are not independent. This document will not go into detail about the difference between these two, but you should be aware that extreme samples (e.g., a sample very close to being all $1$'s or all $0$'s) are more likely when sampling with replacement. In essence, this results in our distribution over $K$ (the number of successes in the population) having fatter tails when we use the $Binomial$ likelihood versus the $Hypergeometric$ likelihood. The table below summarizes which distribution we will use for our likelihood.
Case | Yes | No |
---|---|---|
1 | Binomial | Hypergeometric |
2 | Multinomial | MV Hypergeometric |
Of course, the choice of prior $\pi(\theta)$ depends on the likelihood we choose. This document will focus on the likely choice; if we are using a $Binomial$ or $Multinomial$ likelihood, we probably will use some form of the conjugate prior, $Beta(a, b)$ or $Dirichlet(\alpha)$, respectively. In particular, the most common choices are $Beta(1, 1)$, the continuous uniform, and $Beta(1/2, 1/2)$, the Jeffreys prior, for the $Binomial$ case. Likewise, these have direct extensions in $Dirichlet(1, \dots, 1)$, and $Dirichlet(1/2, \dots, 1/2)$ for the $Multinomial$ case.
As we will see the multi-category extensions of these distributions does not change much. Thus, we will start with the binary versions and then discuss what changes when we switch to having three or more categories.
Things are very simple if we choose to use a $Binomial$ likelihood. If we apply the above theory to the binomial, we get
\begin{align} f(\mathbf{Y} = \mathbf{y}, \mathbf{p} | \mathbf{X} = \mathbf{x}) &= \frac{ f(\mathbf{X} = \mathbf{x} | \mathbf{Y} = \mathbf{y}, p) f(\mathbf{Y} = \mathbf{y}, p) }{ f(\mathbf{X} = \mathbf{x}) } \ &\propto f(\mathbf{X} = \mathbf{x} | p ) f(\mathbf{Y} = \mathbf{y} | p) \pi(p) \ &\propto \pi(p | \mathbf{X} = \mathbf{x}) f(\mathbf{Y} = \mathbf{y} | p) \end{align}
Using the conjugate $Beta(a, b)$ prior gives us the canonical $Beta(K+a, N-K+b)$ posterior. As we will see, it is fairly straightforward to create an MCMC algorithm to jitter $Y$ and accept or reject using this posterior.
If we are instead using the $Multinomial$ likelihood, there is not much that changes. Notably, $p$ and $K$ are now vectors of length $c$, where $c$ is the number of categories, and our prior will likely be $Dirichlet(\alpha)$, the conjugate prior for the $Multinomial$ distribution. The new posterior with the conjugate prior is $Dirichlet(K_1 + \alpha_1, \dots, K_c + \alpha_c)$.
The common parameterization of the $Hypergeometric$ distribution can make this rather confusing. Instead of working with that, lets make our own.
Let $f(\mathbf{Y}, K, \mathbf{X}, J)$ denote the $Hypergeometric$ probability mass function, where:
Now, we can apply the above theory to our specific distribution, and we get
\begin{align} f(\mathbf{Y} = \mathbf{y}, \mathbf{K} | \mathbf{X} = \mathbf{x}, J) &= \frac{ f(\mathbf{X} = \mathbf{x}, J | \mathbf{Y} = \mathbf{y}, K) f(\mathbf{Y} = \mathbf{y}, K) }{ f(\mathbf{X} = \mathbf{x}, J) } \ &\propto f(\mathbf{X} = \mathbf{x}, J | K ) f(\mathbf{Y} = \mathbf{y} | K) \pi(K) \ &\propto \pi(K | \mathbf{X} = \mathbf{x}, J) f(\mathbf{Y} = \mathbf{y} | K) \end{align}
We also need to choose $\pi(K)$, the prior on the number of successes in the population. Above, if we were working with $p$ from the $Binomial$ distribution, we would likely want to use some version of the conjugate prior, $Beta(a, b)$.
From Jeffreys (1946, 1961), the induced prior on $K$ can be found by constructing a hierarchical prior of the form
$$ Binom(K | p) Beta(p | a, b) $$
and then marginalizing to find $\pi(K)$. Thus,
\begin{align*} \pi(K) &= \int_{0}^{1} {N \choose K} p^K (1-p)^{N-K} \frac{1}{\beta(a, b)} p^{a-1} (1-p)^{b-1} dp \ &= {N \choose K} \frac{\beta(K+a, N-K+b)}{\beta(a, b)} \ &= {N \choose K} \frac{\Gamma(a+b)}{\Gamma(a) \Gamma(b)} \frac{\Gamma(K+a) \Gamma(N-K+b)}{\Gamma(N+a+b)}
\end{align*}
For certain values of $a$ and $b$, this will simplify much further, but not in general. Lets visualize this prior for $N = 100$.
Whatever choice of $a$ and $b$ we make, creating the posterior from here is fairly straightforward; of course, this does not simplify in any nice ways like the $Binomial$ would.
We will define the $Multivariate\ Hypergeometric$ distribution in the same fashion as before, except this time $\mathbf{K}$ and $\mathbf{J}$ are vectors.
Additionally, we need to find a prior $\pi(\mathbf{K})$. Suppose there are $c$ categories; following the same steps as the simple 1-dimensional example, we have:
\begin{align*} \pi(\mathbf{K}) &= \int_{0}^{1} {N \choose {\mathbf{K}1, ..., \mathbf{K}_c}} \prod{i = 1}^c \mathbf{p}i^{\mathbf{K}_i} \frac{1}{\beta(\mathbf{\alpha})} \prod{i = 1}^c \mathbf{p}_i^{\mathbf{\alpha}_i - 1} d\mathbf{p} \ &= {N \choose {\mathbf{K}_1, ..., \mathbf{K}_c}} \frac{\beta(\mathbf{K} + \mathbf{\alpha})}{\beta(\mathbf{\alpha})}
\end{align*}
From here, everything is fairly straightforward.
Returning to the distribution we are trying to sample from, we had
$$ f(\theta, \mathbf{Y} = \mathbf{y}| \mathbf{X} = \mathbf{x}) \propto \pi(\theta | \mathbf{X} = \mathbf{x}) f(\mathbf{Y} = \mathbf{y} | \theta). $$ Whether using the $Binomial$ or $Hypergeometric$ distribution, in both cases this $f(\mathbf{Y} = \mathbf{y} | \theta$ term is essential and cannot be forgotten; without it, the sample ends up essentially not mattering.
To illustrate, let $N = 10$. With such a small population, we can easily list every possible population. There are of course $2^{10}$ of them, since every subject in the population has two possible options. If we assume all possible populations are equally likely, we can visualize the probability distribution of $K$, the number of $1$'s in the population, with a simple bar plot.
This means that without any prior knowledge, or even an observed sample, we would expect $K = 5$ to be $252$ times more likely than $K = 0$ or $K = 10$. This is clearly incorrect; if we know nothing about $K$, then the distribution over $K$ should be flat. This term fixes that problem; in the binary case it is simply the reciprocal of the binomial coefficient ${N \choose K}$.
This correction factor carries over to the multiclass method, but it now uses the multinomial coefficient.
Here we will outline straightforward approaches to synthesis for all four possible distributions. There are a few things worth discussing further.
Should we make all populations include our sample? For example, if we observe a vector $x = (1, 0, 0, 0, 1, 0)$ and there are $c = 2$ classes, should all synthetic populations contain at least four $0$'s and two $1's$? If we are using the $Hypergeometric$ distribution, this actually happens automatically, but if we use the $Binomial$ distribution, we have the choice of enforing this criterion or not.
For the purpose of a more fair comparison, in this document we will not enforce that all populations contain the sample.
Here is a very basic function that will jitter $\mathbf{Y}$, and use the posterior in terms of $p$ to determine whether to accept or reject synthesized populations.
popsim_binomial <- function(obs, N, samples = 1000, a = 0.5, b = 0.5) { logcorrection <- function(N, p) { -lchoose(N, N * p) } logprior <- function(p, a, b) { dbeta(p, a, b, log = TRUE) } logpost <- function(p, J, n, a, b) { dbinom(J, n, p, log = TRUE) + logprior(p, a, b) + logcorrection(N, p) } J <- sum(obs) n <- length(obs) out <- matrix(NA, nrow = samples + 1, ncol = N) out[1, ] <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - mean(obs), mean(obs))) current_lp <- logpost(mean(out[1, ]), J, n, a, b) for(i in 2:(samples + 1)) { current <- out[i - 1, ] for(k in 1:N) { proposal <- current proposal[k] <- as.integer(!current[k]) proposal_lp <- logpost(mean(proposal), J, n, a, b) u <- runif(1) if(!is.na(proposal_lp) & log(u) <= proposal_lp - current_lp) { current <- proposal current_lp <- proposal_lp } } out[i, ] <- current } return(out[-1, ]) }
Not much has to change here. Depending how you code multiple categories, working with them can be quite difficult. The easiest set of categories to work with is ${1, 2, \dots, c}$. This simplifies things a lot, mainly because it allows us to use tabulate
, which speeds up the process considerably. The code below will fail if you pass it a sample with either of the following properties:
obs = c(1, 0, 1, 0, 1, 2, 1, 2)
)obs = c(1, 3, 1, 1, 3, 1, 3, 3)
) -- it will assume that there is a category 2, and just not observed in the sample, which likely isn't what you wantThis may be confusing because in the binary cases we need our samples to be ${0, 1}$, but when working with more than two levels things can actually get quite complicated, and enforcing this simple criterion saves us a lot of headache. It is relatively simple to sandwich this method between two transformations to go from the original levels to positive integers, and then back to the original levels at the end.
popsim_multinomial <- function(obs, N, samples = 1000, alpha = rep(0.5, length(unique(obs)))) { logcorrection <- function(N, K) { -lmnchoose(N, K) } logprior <- function(p, alpha) { ddirichlet(p, alpha, log = TRUE) } logpost <- function(K, N, J, n, alpha) { p <- K / N dmultinom(J, prob = p, log = TRUE) + logprior(p, alpha) + logcorrection(N, K) } J <- tabulate(obs) uniq <- sort(unique(obs)) classes <- max(uniq) n <- length(obs) out <- matrix(NA, nrow = samples + 1, ncol = N) out[1, ] <- rep(obs, length.out = N) current_lp <- logpost(tabulate(out[1, ], nbins = classes), N, J, n, alpha) for(i in 2:(samples + 1)) { current <- out[i - 1, ] for(k in 1:N) { proposal <- current proposal[k] <- sample(uniq[-proposal[k]], 1) proposal_lp <- logpost(tabulate(proposal, nbins = classes), N, J, n, alpha) u <- runif(1) if(!is.na(proposal_lp) & log(u) <= proposal_lp - current_lp) { current <- proposal current_lp <- proposal_lp } } out[i, ] <- current } return(out[-1, ]) }
Here, we do the same thing we did for the $Binomial$ case, but instead use the posterior in terms of $K$ to determine whether to accept or reject synthesized populations.
popsim_hypergeometric <- function(obs, N, samples = 1000, a = 0.5, b = 0.5) { logcorrection <- function(N, K) { -lchoose(N, K) } logprior <- function(K, N, a, b) { lchoose(N, K) + lbeta(K + a, N - K + b) - lbeta(a, b) } logpost <- function(K, N, J, n, a, b) { dhyper(J, K, N - K, n, log = TRUE) + logprior(K, N, a, b) + logcorrection(N, K) } J <- sum(obs) n <- length(obs) out <- matrix(NA, nrow = samples + 1, ncol = N) out[1, ] <- sample(c(0, 1), size = N, replace = TRUE, prob = c(1 - mean(obs), mean(obs))) current_lp <- logpost(sum(out[1, ]), N, J, n, a, b) for(i in 2:(samples + 1)) { current <- out[i - 1, ] for(k in 1:N) { proposal <- current proposal[k] <- as.integer(!current[k]) proposal_lp <- logpost(sum(proposal), N, J, n, a, b) u <- runif(1) if(!is.na(proposal_lp) & log(u) <= proposal_lp - current_lp) { current <- proposal current_lp <- proposal_lp } } out[i, ] <- current } return(out[-1, ]) }
This is also pretty straightforward; however, we do have to follow the rules we outlined for the $Multinomial$ method. The only difference between this an the $Multinomial$ method is that we use the posterior with respect to $K$ that accounts for sampling without replacement.
popsim_mvhypergeometric <- function(obs, N, samples = 1000, alpha = rep(0.5, length(unique(obs)))) { logcorrection <- function(N, K) { -lmnchoose(N, K) } logprior <- function(K, N, alpha) { lmnchoose(N, K) + lmvbeta(K + alpha) - lmvbeta(alpha) } logpost <- function(K, N, J, n, alpha) { dmvhyper(J, K, n, log = TRUE) + logprior(K, N, alpha) + logcorrection(N, K) } J <- tabulate(obs) uniq <- sort(unique(obs)) classes <- max(uniq) n <- length(obs) out <- matrix(NA, nrow = samples + 1, ncol = N) out[1, ] <- rep(obs, length.out = N) current_lp <- logpost(tabulate(out[1, ], nbins = classes), N, J, n, alpha) for(i in 2:(samples + 1)) { current <- out[i - 1, ] for(k in 1:N) { proposal <- current proposal[k] <- sample(uniq[-proposal[k]], 1) proposal_lp <- logpost(tabulate(proposal, nbins = classes), N, J, n, alpha) u <- runif(1) if(!is.na(proposal_lp) & log(u) <= proposal_lp - current_lp) { current <- proposal current_lp <- proposal_lp } } out[i, ] <- current } return(out[-1, ]) }
The first thing we need is our samples. We are going to use the following:
obs_binary <- c(1, 0, 0, 0, 0, 0, 1, 1, 0, 1) # 60% / 40% obs_mc <- c(1, 2, 3, 1, 1, 2, 1, 2, 3, 1) # 50% / 30% / 20%
For the binary methods, we can compare the posterior of how many $1$'s are in the populations to see if the two methods give similar results. Here are the densities for N = 100 (left) and N = 1000 (right), both using 3 × 104 synthetic populations. The $Binomial$ method is represented in red, and the $Hypergeometric$ in blue.
For the multiclass methods, we can instead look at a specific margin, e.g. the margin for category $1$. Here are the densities for N = 100 (left) and N = 1000 (right), both using 5000 synthetic populations. The $Multinomial$ method is represented in red, and the $Multivariate\ Hypergeometric$ in blue.
Likewise, we can look at how long each method takes. None of these functions are optimized in any way; though they are written to be as similar as possible so the comparisons are as useful as possible. The total time is representative of creating 3 × 104 synthetic populations for the binary methods, and 5000 synthetic populations for the multiclass methods.
|
|
|
|
|
|
|
|
We can also consider the empirical coverage rates of credible intervals. We will only include the binary methods here, because the multiclass methods take far too long using the simple formulations above. They can be optimized further, of course; for large $N$ it can be beneficial to update the table during the iterations instead of recalculating it. We haven't done so here because we wanted the most simple formulations. For these coverage rates, we will only use $N = 100,\ n = 10$, and we create 100 synthetic populations for each of the 1000 samples per value of $K$.
Below we see the empirical coverage rates for equal-tailed credible intervals capturing the true $K$, the number of $1$'s in the population. Both methods do quite well with K close to $0.5 * N$; however, the $Binomial$ approximation does quite poorly compared to the $Hypergeometric$ method as we approach either tail.
3 | 10 | 17 | 23 | 30 | 37 | 43 | 50 | 57 | 63 | 70 | 77 | 83 | 90 | 97 | |
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
Binomial | 21.2 | 61.0 | 80.6 | 88.0 | 88.0 | 90.6 | 89.2 | 89.2 | 88.1 | 88.1 | 86.1 | 84.8 | 79.7 | 60.2 | 23.2 |
Hypergeometric | 95.9 | 88.9 | 83.9 | 87.1 | 87.5 | 88.6 | 88.5 | 87.0 | 86.3 | 87.0 | 83.4 | 83.4 | 82.6 | 86.8 | 95.1 |
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.