\newcommand{\Y}{\boldsymbol Y}
\newcommand{\rr}{\boldsymbol \rho}
\newcommand{\RR}{\boldsymbol{\mathcal R}}
\renewcommand{\aa}{\boldsymbol \alpha}
\newcommand{\LL}{\mathcal L}
\newcommand{\ind}{\stackrel{\textrm{ind}}{\sim}}
\newcommand{\iid}{\stackrel{\textrm{iid}}{\sim}}
\newcommand{\var}{\textrm{var}}
\newcommand{\pv}{\textrm{p}*\textrm{v}}
\newcommand{\pvb}{\textrm{p}*\textrm{v}^{\textrm{boot}}}
\newcommand{\pvp}{\textrm{p}_\textrm{v}^{\textrm{post}}}

require(knitr) require(rmarkdown) knitr::opts_chunk$set(cache = FALSE, autodep = FALSE) if(FALSE) { # compile #rmarkdown::render(file.path("vignettes", "MADPop-quicktut.Rmd")) rmarkdown::render("MADPop-quicktut.Rmd") } # view it by opening MADPop/vignettes/MADPop-tutorial.html

This tutorial shows how to use the **MADPop** package to test for
genetic differences between two populations, of which the individuals of a same species contain a variable number of alleles.

```
require(MADPop)
```

nObs <- nrow(fish215) nPop <- nlevels(fish215$Lake) nAlleles <- length(table(c(as.matrix(fish215[,-1])))) - 1

Our data consists of $N = `r nObs`

$ recordings of Major Histocompatibility Complex (MHC) genotypes of lake trout from $K = `r nPop`

$ lakes in Ontario, Canada. For each of the fish, between 1-4 alleles in the MHC genotype are recorded. This is partially because duplicate genes are undetectable by current instrumentation, and possibly because the fish possess a variable number of alleles of a given MHC gene.

Our dataset `fish215`

is included with **MADPop**. A random sample from it looks like this:

head(fish215[sample(nObs),])

The first column is the lake name (or population ID) for each sample, the remaining four columns are for potentially recorded allele codes (`A1`

-`A4`

). Here the code to identify a unique allele is a small letter followed by a number, but it could have been the sequence of integers $1, 2,\ldots, A$, which for the `fish215`

data is $A = `r nAlleles`

$ unique alleles.

It is relatively straightforward to import a CSV file into the format above. An example of this is given along with our raw data in the **extdata** directory of the local copy of the **MADPop** package.

Suppose that we wish to compare two lakes, say `r popId[1]`

and `r popId[2]`

. The allele counts in these lakes are in the table below. It is a subset of the full contingency table on all $K = `r nPop`

$ lakes, which is produced by the **MADPop** function `UM.suff`

:

popId <- c("Dickey", "Simcoe") # lakes to compare Xsuff <- UM.suff(fish215) # summary statistics for dataset ctab <- Xsuff$tab[popId,] # contingency table ctab <- ctab[,colSums(ctab) > 0] # remove alleles with no counts #ctab rbind(ctab, Total = colSums(ctab))

The unique allele identifiers are encoded as integers between $1$ and $A$ and separated by dots. The original allele names are stored in `Xsuff$A`

, such that the genotype of the first column `r colnames(ctab)[1]`

is

gtype <- colnames(ctab)[1] gtype <- as.numeric(strsplit(gtype, "[.]")[[1]]) gtype names(gtype) <- paste0("A", gtype) sapply(gtype, function(ii) Xsuff$A[ii])

There are $C = `r ncol(ctab)`

$ genotype combinations observed in these two lakes, corresponding to each column of the table.

In the two-population problem we have $K = 2$ lakes with $N_1$ and $N_2$ fish sampled from each. Let $\Y_k = (Y_{k1}, \ldots Y_{kC})$ denote the counts for each genotype observed in lake $k$, such that $\sum_{i=1}^C Y_{ki} = N_k$. The sampling model for these data is $$ \Y_k \ind \textrm{Multinomial}(N_k, \rr_k), \quad k = 1,2, $$ where $\rr_k = (\rho_{k1},...,\rho_{kC})$ are the population proportions of each genotype, and $\sum_{i=1}^C \rho_{ki} = 1$.

Our objective is to test $$ \begin{split} H_0 &: \textrm{The two populations have the same genotype proportions} \ & \phantom{: } \iff \rr_1 = \rr_2. \end{split} $$ The classical test statistics for assessing $H_0$ are Pearson's Chi-Square statistic $\mathcal X$ and the Likelihood Ratio statistic $\Lambda$, $$ \mathcal X = \sum_{k=1}^2 \sum_{i=1}^C \frac{(N_k\hat\rho_i - Y_{ki})^2}{N_k\hat\rho_i}, \qquad \Lambda = 2 \sum_{k=1}^2 \sum_{i=1}^C Y_{ki} \log\left(\frac{Y_{ki}}{N_k\hat\rho_i}\right), \qquad \hat \rho_i = \frac{Y_{1i} + Y_{2i}}{N_1 + N_2}. $$ Under $H_0$, the asymptotic distribution of either of these test statistics $T = \mathcal X$ or $\Lambda$ is $\chi^2_{(C-1)}$, such that the $p$-value $$ \pv = \textrm{Pr}(T > T_\textrm{obs} \mid H_0) $$ for an observed value of $T_\textrm{obs}$ can be estimated as follows:

# observed values of the test statistics chi2.obs <- chi2.stat(ctab) # Pearson's chi^2 LRT.obs <- LRT.stat(ctab) # LR test statistic T.obs <- c(chi2 = chi2.obs, LRT = LRT.obs) # p-value with asymptotic calculation C <- ncol(ctab) pv.asy <- pchisq(q = T.obs, df = C-1, lower.tail = FALSE) signif(pv.asy, 2)

The Chi-Square and LR tests are asymptotically equivalent and so should give roughly the same $p$-values. The huge discrepancy observed here indicates that the sample sizes are too small for asymptotics to kick in. A more reliable $p$-value estimate can be obtained by the Bootstrap method, which in this case consists of simulating $M$ contigency tables with $\Y_k^{(m)} \ind \textrm{Multinomial}(N_k, \hat{\rr})$, $m = 1, \ldots, M$, where $\hat{\rr}$ is the estimate of the common probability vector $\rr_1 = \rr_2$ under $H_0$. For each contingency table $(\Y_1^{(m)}, \Y_2^{(m)})$, we calculate the test statistic $T^{(m)}$, and the bootstrapped p-value is defined as
$$
\pvb = \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_\textrm{obs}).
$$
$\pvb$ is calculated with **MADPop** as follows:

N1 <- sum(ctab[1,]) # size of first sample N2 <- sum(ctab[2,]) # size of second sample rho.hat <- colSums(ctab)/(N1+N2) # common probability vector # bootstrap distribution of the test statistics # set verbose = TRUE for progress output system.time({ T.boot <- UM.eqtest(N1 = N1, N2 = N2, p0 = rho.hat, nreps =1e4, verbose = FALSE) }) # bootstrap p-value pv.boot <- rowMeans(t(T.boot) >= T.obs) signif(pv.boot, 2)

Note that the bootstrap $p$-values for both tests are roughly the same and decisively reject $H_0$, whereas the less reliable asymptotic $p$-values both failed to reject (at quite different significance levels).

Bootstrapping overcomes many deficiencies of the asymptotic $p$-value calculation. However, bootstrapping has a tendency to reject $H_0$ when sample sizes are small. To see why this is, consider all columns of `ctab`

which have only one genotype count between the two lakes:

itab1 <- colSums(ctab) == 1 # single count genotypes cbind(ctab[,itab1], Other = rowSums(ctab[,!itab1]), Total = rowSums(ctab))

c1 <- sum(itab1) # number of single-count columns n1 <- sum(ctab[,itab1]) # number of single counts

There are $c_1 = `r c1`

$ such columns, accounting for $\hat p_1 = `r n1/sum(ctab)`

$ of the common genotype distribution under $H_0$, as estimated from the two-lake sample. For each of these columns, observing counts in one lake but not the other provides evidence against $H_0$. Moreover, under the estimated common distribution $\hat {\rr}$, it is very unlikely to have counts in only one of the lakes for each of these $c_1 = `r c1`

$ genotypes. Therefore, the data appear to provide very strong evidence against $H_0$. However, it is not so unlikely to have $c_1 = `r c1`

$ one-count genotypes if the true number of unique genotypes in these two lakes is much larger than the observed value of $C = `r C`

$. With $C = `r C`

$ unique genotypes in only $N = N_1 + N_2 = `r N1 + N2`

$ fish samples, it is quite plausible that a new sample of fish would yield several genotypes which are not present in the original two-lake sample `ctab`

.

One way to obtain information about the unobserved genotypes in lakes `r popId[1]`

and `r popId[2]`

is to consider the genotypes in *all* $K = `r nPop`

$ Ontario lakes for which we have collected data. A natural way to do this is through a hierarchical model:
$$
\begin{split}
\Y_k \mid \rr_k & \ind \textrm{Multinomial}(N_k, \rr_k), \quad k = 1,\ldots,K \
\rr_k & \iid \textrm{Dirichlet}(\aa), \qquad \aa = (\alpha_1, \ldots, \alpha_C).
\end{split}
$$
That is, each population is allowed to have its own probability vector $\rr_k$. However, these probability vectors are drawn from a common Dirichlet distribution. The common distribution specifies that before any data is drawn, we have
$$
E[\rr] = \bar{\aa}, \qquad \var(\rho_i) = \frac{(\bar \alpha_i)(1-\bar \alpha_i)}{\alpha_0 + 1},
$$
where $\alpha_0 = \sum_{i=1}^C \alpha_i$ and $\bar{\aa} = \aa / \alpha_0$. Moreover, the posterior distribution of $\rr_k$ given $\aa$ and the data is also Dirichlet:
$$
\rr_k \mid \Y \ind \textrm{Dirichlet}(\aa + \Y_k).
$$
In this sense, the posterior estimate of the proportion of genotype $i$ in population $k$, $\rho_{ki}$, can be non-zero even if $Y*{ki} = 0$. In practice, $\aa$ is estimated from $\Y$, the data from all $K$ populations (details in the following section). The extent to which the genotypes observed in one lake affect inference in the other lakes is determined by $\alpha*0$ (the larger this value, the larger the effect).

The likelihood function for this model corresponds to a Dirichlet-Multinomial distribution, $$ \begin{split} \mathcal L(\aa \mid \Y) = \prod_{k=1}^K p(\Y_k \mid \aa) & = \prod_{k=1}^K \int p(\Y_k \mid \rr_k) p(\rr_k \mid \aa) \,\mathrm{d}\rr_k \ & = \prod_{k=1}^K \left[\frac{N_k!\cdot\Gamma(\alpha_0)}{\Gamma(N_k+\alpha_0)} \prod_{i=1}^C \frac{\Gamma(Y_{ki} + \alpha_i)}{Y_{ki}!\cdot\Gamma(\alpha_i)}\right], \end{split} $$ from which $\aa$ can be estimated by maximum likelihood [@minka00]. Alternatively we estimate $\aa$ using a Bayesian approach, which is more readily extensible to more complex genotype models, for which $\mathcal L(\aa \mid \Y)$ has no closed form (see Future Work).

First, we specify a prior distribution
$$
\pi(\alpha_0, \bar{\aa}) = \frac{1}{(1+ \alpha_0)^2},
$$
which is a uniform distribution on the prior probability vector $\bar{\aa}$, and a uniform distribution on the variance-like parameter $A = (1+\alpha_0)^{-1}$, in the sense that the prior mean and variance of a given genotype population probability are
$$
E[\rho_{kj}] = \tilde \alpha_j, \qquad \var(\rho_{kj}) = A \cdot \tilde \alpha_j(1-\tilde \alpha_j).
$$
Then, we sample from the posterior distribution $p(\aa \mid \Y) \propto \LL(\aa \mid \Y) \pi(\aa)$ using a Markov chain Monte Carlo (MCMC) algorithm provided by the **stan** programming language [@rstan], as detailed below.

Under the hierarchical model, the null hypothesis of equal genetic proportions between two populations, say $k = 1,2$, is $H_0: \rr_1 = \rr_2 = \rr_{12}$. Our Bayesian hypothesis-testing strategy is implemented in two steps:

- Sample from the posterior distribution $p(\aa \mid \Y, H_0)$.
- Sample from the posterior distribution of either classical test statistic, $T = \mathcal X$ or $\Lambda$, thus obtaining a
*posterior p-value*$$ \pvp = \textrm{Pr}(T > T_\textrm{obs} \mid \Y, H_0). $$

In order to estimate $\aa$ under $H_0$, we apply the Dirichlet-Multinomial distribution to $K-1$ lakes, with the first two being collapsed into a common count vector $\Y_{12} = \Y_1 + \Y_2$ with sample size $N_{12} = N_1 + N_2$. MCMC sampling is implemented via a Hybrid Monte Carlo (HMC) variant provided by the **R** package **rstan**. In **MADPop**, sampling from $p(\aa \mid \Y, H_0)$ is accomplished with the function `hUM.post`

, where "hUM" stands for *hierarchical Unconstrained Multinomial* model. We start by merging the two samples from equal distributions under $H_0$:

popId # equal lakes under H0 eqId0 <- paste0(popId, collapse = ".") # merged lake name popId0 <- as.character(fish215$Lake) # all lake names popId0[popId0 %in% popId] <- eqId0 table(popId0, dnn = NULL) # merged lake counts

Next, we sample from $p(\aa \mid \Y, H_0)$ using **rstan**:

X0 <- cbind(Id = popId0, fish215[-1]) # allele data with merged lakes nsamples <- 1e4 fit0 <- hUM.post(nsamples = nsamples, X = X0, rhoId = eqId0, # output rho only for merged lakes chains = 1, # next two arguments are passed to rstan warmup = min(1e4, floor(nsamples/10)), full.stan.out = FALSE)

`rstan`

typically throws a number of exceptions which are usually harmless. The `MADPop`

package lists `rstan`

as a dependency, such that its sophisticated tuning mechanism is exposed. Optionally, `hUM.post`

returns the entire `rstan`

output (`full.stan.output = TRUE`

), which can be run through `rstan`

's MCMC convergence diagnostics.

By setting the `rhoId`

argument of `hUM.post`

, the `fit0`

object contains samples from $p(\aa, \rr_{12} \mid \Y, H_0)$. Boxplots of the `r nbxp`

highest posterior probability genotypes in the common distribution $\rr_{12}$ are displayed below:

rho.post <- fit0$rho[,1,] # p(rho12 | Y) # sort genotype counts by decreasing order rho.count <- colSums(Xsuff$tab[popId,]) rho.ord <- order(colMeans(rho.post), decreasing = TRUE) # plot nbxp <- 50 # number of genotypes for which to make boxplots clrs <- c("#999999", "#E69F00", "#56B4E9", "#009E73", "#F0E442", "#0072B2", "#D55E00", "#CC79A7") rho.ct <- rho.count[rho.ord[1:nbxp]] rho.lgd <- unique(sort(rho.ct)) rho.box <- rho.post[,rho.ord[1:nbxp]] par(mar = c(5,4.5,.5,.5)+.1, cex.axis = .7) boxplot(x = rho.box, las = 2, col = clrs[rho.ct+1], pch = 16, cex = .2) title(xlab = "Genotype", line = 4) title(ylab = expression(p(bold(rho)[(12)]*" | "*bold(Y)))) legend("topright", legend = rev(rho.lgd), fill = rev(clrs[rho.lgd+1]), title = "Counts", ncol = 2, cex = .8)

This is calculated analogously to the bootstrapped p-value, by generating $M$ contingency tables with $\Y_k^{(m)} \ind \textrm{Multinomial}(N_k, \rr_{12}^{(m)}$, $m = 1, \ldots, M$. However, for each contingency table, the common probability vector $\rr_{12}^{(m)}$ is different, corresponding to a random draw from $p(\rr_{12} \mid \Y, H_0)$. The test statistic $T^{(m)}$ is then calculated and we have
$$
\pvp = \frac 1 M \sum_{m=1}^M \delta(T^{(m)} \ge T_{\textrm{obs}}).
$$
$\pvp$ is calculated with **MADPop** as follows:

system.time({ T.post <- UM.eqtest(N1 = N1, N2 = N2, p0 = rho.post, nreps = 1e4, verbose = FALSE) }) # posterior p-value pv.post <- rowMeans(t(T.post) >= T.obs)

We can now compare the three types of p-values (asymptotic, bootstrapped, posterior) for each test statistic ($\mathcal X$ and $\Lambda$):

pv.tab <- cbind(asy = pv.asy, boot = pv.boot, post = pv.post) pv.tab <- signif(pv.tab*100, 2) colnames(pv.tab) <- c("Asymptotic", "Bootstrap", "Posterior") rownames(pv.tab) <- c("$\\mathcal X$", "$\\Lambda$") kable(pv.tab, digits = 2, caption = "p-value ($\\times100\\%$)")

We see that $\pvp$ is much more conservative than $\pvb$.

Here we used a completely unconstrained Multinomial model for the genotype counts, in the sense that the only restriction on $\rr_k$ is that $\rho_{ki} \ge 0$ and $\sum_{i=1}^C \rho_{ki} = 1$. However, it is possible to impose genetic constraints such as Hardy-Weinberg equilibrium, preferential mating, etc., which effectively reduce the degrees of freedom of $\rr_k$ below $C-1$. In this case, the closed-form likelihood $\LL(\aa \mid \Y)$ is typically unavailable, but the **stan** code can easily be modified to sample from $p(\aa, \RR \mid \Y)$, where $\RR = (\rr_1, \ldots, \rr_K)$. We hope to feature some of these extensions to the basic Dirichlet-Multinomial model in the next version of **MADPop**.

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.