# Bayesian Testing of Equal Genotype Proportions between In MADPop: MHC Allele-Based Differencing Between Populations


require(knitr)
require(rmarkdown)
knitr::opts_chunk$set(cache = FALSE, autodep = FALSE) if(FALSE) { # compile #rmarkdown::render(file.path("vignettes", "MADPop-quicktut.Rmd")) rmarkdown::render("MADPop-quicktut.Rmd") } # view it by opening MADPop/vignettes/MADPop-tutorial.html  This tutorial shows how to use the MADPop package to test for genetic differences between two populations, of which the individuals of a same species contain a variable number of alleles. require(MADPop)  ## Pre-Processing nObs <- nrow(fish215) nPop <- nlevels(fish215$Lake)
nAlleles <- length(table(c(as.matrix(fish215[,-1])))) - 1


Our data consists of $N = r nObs$ recordings of Major Histocompatibility Complex (MHC) genotypes of lake trout from $K = r nPop$ lakes in Ontario, Canada. For each of the fish, between 1-4 alleles in the MHC genotype are recorded. This is partially because duplicate genes are undetectable by current instrumentation, and possibly because the fish possess a variable number of alleles of a given MHC gene.

Our dataset fish215 is included with MADPop. A random sample from it looks like this:

head(fish215[sample(nObs),])


The first column is the lake name (or population ID) for each sample, the remaining four columns are for potentially recorded allele codes (A1-A4). Here the code to identify a unique allele is a small letter followed by a number, but it could have been the sequence of integers $1, 2,\ldots, A$, which for the fish215 data is $A = r nAlleles$ unique alleles.

It is relatively straightforward to import a CSV file into the format above. An example of this is given along with our raw data in the extdata directory of the local copy of the MADPop package.

## Two-Population Comparisons




Suppose that we wish to compare two lakes, say r popId[1] and r popId[2]. The allele counts in these lakes are in the table below. It is a subset of the full contingency table on all $K = r nPop$ lakes, which is produced by the MADPop function UM.suff:

popId <- c("Dickey", "Simcoe")          # lakes to compare
Xsuff <- UM.suff(fish215)             # summary statistics for dataset
ctab <- Xsuff$tab[popId,] # contingency table ctab <- ctab[,colSums(ctab) > 0] # remove alleles with no counts #ctab rbind(ctab, Total = colSums(ctab))  The unique allele identifiers are encoded as integers between$1$and$A$and separated by dots. The original allele names are stored in Xsuff$A, such that the genotype of the first column r colnames(ctab)[1] is

gtype <- colnames(ctab)[1]
gtype <- as.numeric(strsplit(gtype, "[.]")[[1]])
gtype
names(gtype) <- paste0("A", gtype)
sapply(gtype, function(ii) Xsuff$A[ii])  There are$C = r ncol(ctab)$genotype combinations observed in these two lakes, corresponding to each column of the table. ### Multinomial Model In the two-population problem we have$K = 2$lakes with$N_1$and$N_2$fish sampled from each. Let$\Y_k = (Y_{k1}, \ldots Y_{kC})$denote the counts for each genotype observed in lake$k$, such that$\sum_{i=1}^C Y_{ki} = N_k$. The sampling model for these data is $$\Y_k \ind \textrm{Multinomial}(N_k, \rr_k), \quad k = 1,2,$$ where$\rr_k = (\rho_{k1},...,\rho_{kC})$are the population proportions of each genotype, and$\sum_{i=1}^C \rho_{ki} = 1$. ### Hypothesis Testing Our objective is to test $$\begin{split} H_0 &: \textrm{The two populations have the same genotype proportions} \ & \phantom{: } \iff \rr_1 = \rr_2. \end{split}$$ The classical test statistics for assessing$H_0$are Pearson's Chi-Square statistic$\mathcal X$and the Likelihood Ratio statistic$\Lambda$, $$\mathcal X = \sum_{k=1}^2 \sum_{i=1}^C \frac{(N_k\hat\rho_i - Y_{ki})^2}{N_k\hat\rho_i}, \qquad \Lambda = 2 \sum_{k=1}^2 \sum_{i=1}^C Y_{ki} \log\left(\frac{Y_{ki}}{N_k\hat\rho_i}\right), \qquad \hat \rho_i = \frac{Y_{1i} + Y_{2i}}{N_1 + N_2}.$$ Under$H_0$, the asymptotic distribution of either of these test statistics$T = \mathcal X$or$\Lambda$is$\chi^2_{(C-1)}$, such that the$p$-value $$\pv = \textrm{Pr}(T > T_\textrm{obs} \mid H_0)$$ for an observed value of$T_\textrm{obs}$can be estimated as follows: # observed values of the test statistics chi2.obs <- chi2.stat(ctab) # Pearson's chi^2 LRT.obs <- LRT.stat(ctab) # LR test statistic T.obs <- c(chi2 = chi2.obs, LRT = LRT.obs) # p-value with asymptotic calculation C <- ncol(ctab) pv.asy <- pchisq(q = T.obs, df = C-1, lower.tail = FALSE) signif(pv.asy, 2)  The Chi-Square and LR tests are asymptotically equivalent and so should give roughly the same$p$-values. The huge discrepancy observed here indicates that the sample sizes are too small for asymptotics to kick in. A more reliable$p$-value estimate can be obtained by the Bootstrap method, which in this case consists of simulating$M$contigency tables with$\Y_k^{(m)} \ind \textrm{Multinomial}(N_k, \hat{\rr})$,$m = 1, \ldots, M$, where$\hat{\rr}$is the estimate of the common probability vector$\rr_1 = \rr_2$under$H_0$. For each contingency table$(\Y_1^{(m)}, \Y_2^{(m)})$, we calculate the test statistic$T^{(m)}$, and the bootstrapped p-value is defined as $$\pvb = \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_\textrm{obs}).$$$\pvb$is calculated with MADPop as follows: N1 <- sum(ctab[1,]) # size of first sample N2 <- sum(ctab[2,]) # size of second sample rho.hat <- colSums(ctab)/(N1+N2) # common probability vector # bootstrap distribution of the test statistics # set verbose = TRUE for progress output system.time({ T.boot <- UM.eqtest(N1 = N1, N2 = N2, p0 = rho.hat, nreps =1e4, verbose = FALSE) }) # bootstrap p-value pv.boot <- rowMeans(t(T.boot) >= T.obs) signif(pv.boot, 2)  Note that the bootstrap$p$-values for both tests are roughly the same and decisively reject$H_0$, whereas the less reliable asymptotic$p$-values both failed to reject (at quite different significance levels). ## Pairwise Comparisons between Multiple Populations Bootstrapping overcomes many deficiencies of the asymptotic$p$-value calculation. However, bootstrapping has a tendency to reject$H_0$when sample sizes are small. To see why this is, consider all columns of ctab which have only one genotype count between the two lakes: itab1 <- colSums(ctab) == 1 # single count genotypes cbind(ctab[,itab1], Other = rowSums(ctab[,!itab1]), Total = rowSums(ctab))  c1 <- sum(itab1) # number of single-count columns n1 <- sum(ctab[,itab1]) # number of single counts  There are$c_1 = r c1$such columns, accounting for$\hat p_1 = r n1/sum(ctab)$of the common genotype distribution under$H_0$, as estimated from the two-lake sample. For each of these columns, observing counts in one lake but not the other provides evidence against$H_0$. Moreover, under the estimated common distribution$\hat {\rr}$, it is very unlikely to have counts in only one of the lakes for each of these$c_1 = r c1$genotypes. Therefore, the data appear to provide very strong evidence against$H_0$. However, it is not so unlikely to have$c_1 = r c1$one-count genotypes if the true number of unique genotypes in these two lakes is much larger than the observed value of$C = r C$. With$C = r C$unique genotypes in only$N = N_1 + N_2 = r N1 + N2$fish samples, it is quite plausible that a new sample of fish would yield several genotypes which are not present in the original two-lake sample ctab. One way to obtain information about the unobserved genotypes in lakes r popId[1] and r popId[2] is to consider the genotypes in all$K = r nPop$Ontario lakes for which we have collected data. A natural way to do this is through a hierarchical model: $$\begin{split} \Y_k \mid \rr_k & \ind \textrm{Multinomial}(N_k, \rr_k), \quad k = 1,\ldots,K \ \rr_k & \iid \textrm{Dirichlet}(\aa), \qquad \aa = (\alpha_1, \ldots, \alpha_C). \end{split}$$ That is, each population is allowed to have its own probability vector$\rr_k$. However, these probability vectors are drawn from a common Dirichlet distribution. The common distribution specifies that before any data is drawn, we have $$E[\rr] = \bar{\aa}, \qquad \var(\rho_i) = \frac{(\bar \alpha_i)(1-\bar \alpha_i)}{\alpha_0 + 1},$$ where$\alpha_0 = \sum_{i=1}^C \alpha_i$and$\bar{\aa} = \aa / \alpha_0$. Moreover, the posterior distribution of$\rr_k$given$\aa$and the data is also Dirichlet: $$\rr_k \mid \Y \ind \textrm{Dirichlet}(\aa + \Y_k).$$ In this sense, the posterior estimate of the proportion of genotype$i$in population$k$,$\rho_{ki}$, can be non-zero even if$Y{ki} = 0$. In practice,$\aa$is estimated from$\Y$, the data from all$K$populations (details in the following section). The extent to which the genotypes observed in one lake affect inference in the other lakes is determined by$\alpha0$(the larger this value, the larger the effect). ### Parameter Estimation The likelihood function for this model corresponds to a Dirichlet-Multinomial distribution, $$\begin{split} \mathcal L(\aa \mid \Y) = \prod_{k=1}^K p(\Y_k \mid \aa) & = \prod_{k=1}^K \int p(\Y_k \mid \rr_k) p(\rr_k \mid \aa) \,\mathrm{d}\rr_k \ & = \prod_{k=1}^K \left[\frac{N_k!\cdot\Gamma(\alpha_0)}{\Gamma(N_k+\alpha_0)} \prod_{i=1}^C \frac{\Gamma(Y_{ki} + \alpha_i)}{Y_{ki}!\cdot\Gamma(\alpha_i)}\right], \end{split}$$ from which$\aa$can be estimated by maximum likelihood [@minka00]. Alternatively we estimate$\aa$using a Bayesian approach, which is more readily extensible to more complex genotype models, for which$\mathcal L(\aa \mid \Y)$has no closed form (see Future Work). First, we specify a prior distribution $$\pi(\alpha_0, \bar{\aa}) = \frac{1}{(1+ \alpha_0)^2},$$ which is a uniform distribution on the prior probability vector$\bar{\aa}$, and a uniform distribution on the variance-like parameter$A = (1+\alpha_0)^{-1}$, in the sense that the prior mean and variance of a given genotype population probability are $$E[\rho_{kj}] = \tilde \alpha_j, \qquad \var(\rho_{kj}) = A \cdot \tilde \alpha_j(1-\tilde \alpha_j).$$ Then, we sample from the posterior distribution$p(\aa \mid \Y) \propto \LL(\aa \mid \Y) \pi(\aa)$using a Markov chain Monte Carlo (MCMC) algorithm provided by the stan programming language [@rstan], as detailed below. ## Bayesian Hypothesis Testing Under the hierarchical model, the null hypothesis of equal genetic proportions between two populations, say$k = 1,2$, is$H_0: \rr_1 = \rr_2 = \rr_{12}$. Our Bayesian hypothesis-testing strategy is implemented in two steps: 1. Sample from the posterior distribution$p(\aa \mid \Y, H_0)$. 2. Sample from the posterior distribution of either classical test statistic,$T = \mathcal X$or$\Lambda$, thus obtaining a posterior p-value $$\pvp = \textrm{Pr}(T > T_\textrm{obs} \mid \Y, H_0).$$ ### MCMC Sampling from the Posterior Distribution In order to estimate$\aa$under$H_0$, we apply the Dirichlet-Multinomial distribution to$K-1$lakes, with the first two being collapsed into a common count vector$\Y_{12} = \Y_1 + \Y_2$with sample size$N_{12} = N_1 + N_2$. MCMC sampling is implemented via a Hybrid Monte Carlo (HMC) variant provided by the R package rstan. In MADPop, sampling from$p(\aa \mid \Y, H_0)$is accomplished with the function hUM.post, where "hUM" stands for hierarchical Unconstrained Multinomial model. We start by merging the two samples from equal distributions under$H_0$: popId # equal lakes under H0 eqId0 <- paste0(popId, collapse = ".") # merged lake name popId0 <- as.character(fish215$Lake)    # all lake names
popId0[popId0 %in% popId] <- eqId0
table(popId0, dnn = NULL)               # merged lake counts


Next, we sample from $p(\aa \mid \Y, H_0)$ using rstan:

X0 <- cbind(Id = popId0, fish215[-1])  # allele data with merged lakes

nsamples <- 1e4
fit0 <- hUM.post(nsamples = nsamples, X = X0,
rhoId = eqId0,     # output rho only for merged lakes
chains = 1,  # next two arguments are passed to rstan
warmup = min(1e4, floor(nsamples/10)),
full.stan.out = FALSE)


rstan typically throws a number of exceptions which are usually harmless. The MADPop package lists rstan as a dependency, such that its sophisticated tuning mechanism is exposed. Optionally, hUM.post returns the entire rstan output (full.stan.output = TRUE), which can be run through rstan's MCMC convergence diagnostics.




By setting the rhoId argument of hUM.post, the fit0 object contains samples from $p(\aa, \rr_{12} \mid \Y, H_0)$. Boxplots of the r nbxp highest posterior probability genotypes in the common distribution $\rr_{12}$ are displayed below:

rho.post <- fit0$rho[,1,] # p(rho12 | Y) # sort genotype counts by decreasing order rho.count <- colSums(Xsuff$tab[popId,])
rho.ord <- order(colMeans(rho.post), decreasing = TRUE)

# plot
nbxp <- 50            # number of genotypes for which to make boxplots
clrs <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2",
"#D55E00", "#CC79A7")
rho.ct <- rho.count[rho.ord[1:nbxp]]
rho.lgd <- unique(sort(rho.ct))
rho.box <- rho.post[,rho.ord[1:nbxp]]
par(mar = c(5,4.5,.5,.5)+.1, cex.axis = .7)
boxplot(x = rho.box,
las = 2, col = clrs[rho.ct+1], pch = 16, cex = .2)
title(xlab = "Genotype", line = 4)
title(ylab = expression(p(bold(rho)[(12)]*" | "*bold(Y))))
legend("topright", legend = rev(rho.lgd), fill = rev(clrs[rho.lgd+1]),
title = "Counts", ncol = 2, cex = .8)


### Calculating the Posterior P-Value

This is calculated analogously to the bootstrapped p-value, by generating $M$ contingency tables with $\Y_k^{(m)} \ind \textrm{Multinomial}(N_k, \rr_{12}^{(m)}$, $m = 1, \ldots, M$. However, for each contingency table, the common probability vector $\rr_{12}^{(m)}$ is different, corresponding to a random draw from $p(\rr_{12} \mid \Y, H_0)$. The test statistic $T^{(m)}$ is then calculated and we have $$\pvp = \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_{\textrm{obs}}).$$ $\pvp$ is calculated with MADPop as follows:

system.time({
T.post <- UM.eqtest(N1 = N1, N2 = N2, p0 = rho.post, nreps = 1e4,
verbose = FALSE)
})
# posterior p-value
pv.post <- rowMeans(t(T.post) >= T.obs)


We can now compare the three types of p-values (asymptotic, bootstrapped, posterior) for each test statistic ($\mathcal X$ and $\Lambda$):

pv.tab <- cbind(asy = pv.asy, boot = pv.boot, post = pv.post)
pv.tab <- signif(pv.tab*100, 2)
colnames(pv.tab) <- c("Asymptotic", "Bootstrap", "Posterior")
rownames(pv.tab) <- c("$\\mathcal X$", "$\\Lambda$")
kable(pv.tab, digits = 2, caption = "p-value ($\\times100\\%$)")


We see that $\pvp$ is much more conservative than $\pvb$.

## Future Work

Here we used a completely unconstrained Multinomial model for the genotype counts, in the sense that the only restriction on $\rr_k$ is that $\rho_{ki} \ge 0$ and $\sum_{i=1}^C \rho_{ki} = 1$. However, it is possible to impose genetic constraints such as Hardy-Weinberg equilibrium, preferential mating, etc., which effectively reduce the degrees of freedom of $\rr_k$ below $C-1$. In this case, the closed-form likelihood $\LL(\aa \mid \Y)$ is typically unavailable, but the stan code can easily be modified to sample from $p(\aa, \RR \mid \Y)$, where $\RR = (\rr_1, \ldots, \rr_K)$. We hope to feature some of these extensions to the basic Dirichlet-Multinomial model in the next version of MADPop.