In HWxtest: Exact Tests for Hardy-Weinberg Proportions

library(knitr)
library(RefManageR)
BibOptions(restore.defaults=TRUE)
BibOptions(check.entries = FALSE, bib.style = "authortitle",cite.style="authoryear", style = "html", max.names=99, hyperlink = "to.bib", no.print.fields = c("ISSN", "note", "url") )
# setting bib.style to 'authoryear' is what caused problems in RefManageR v. 0.10.5
library(HWxtest)
version <- packageDescription("HWxtest", fields = "Version")
doWhales <- TRUE
adegenet.loaded <- library(adegenet, logical.return=TRUE)
figw <- 7
dfigw <- 10
figh <- 5
set.seed(60823316)

bib <- ReadBib("bibHW.txt")
engels2009 <- bib[["engels2009"]]
levene1949 <- bib[["levene1949"]]
haldane1954 <- bib[["haldane1954"]]
louis1987 <- bib[["louis1987"]]
guo1992 <- bib[["guo1992"]]
ward2014 <- bib[["ward2014"]]
rousset1995 <- bib[["rousset1995"]]
robertson1984 <- bib[["robertson1984"]]
genepop007 <- bib[["genepop007"]]
adegenet <- bib[["adegenet"]]
pegas <- bib[["pegas"]]
morin2012 <- bib[["morin2012"]]
gail1977 <- bib[["gail1977"]]
ez1989 <- bib[["ez1989"]]
olsen2014 <- bib[["olsen2014"]]
hart2012 <- bib[["hart2012"]]

options(width = 120)
library(parallel)
coreCount <- detectCores()
if(coreCount > 1) options(mc.cores=2)
#CRAN seems to reject cores more than 2
#options(mc.cores=detectCores())

HWxtest -- Exact Tests for Hardy-Weinberg Proportions With Multiple Alleles

version `r version`

William R. Engels -- wrengels@wisc.edu

University of Wisconsin, Madison -- Genetics Department

Main features Quick example Plot of test statistic Extended example: Bowhead whales Reading GenePop data Inaccuracy in asymptotic tests Choice of test statistic Multithreading * Bibliography

Main features

The HWxtest package tests whether a set of genotype counts fits Hardy-Weinberg proportions using the methods described by r Citet(engels2009). The package's main features are:

Performs fast exact test with any number of alleles
The $P$ value is determined using a choice of test statistics (see below). One option is to test specifically for an excess of homozygotes or heterozygotes.
Can handle data sets with multiple loci or populations. It accepts data from other population genetics software: GenePop r Citep(genepop007), adegenet r Citep(adegenet), pegas r Citep(pegas). Multithreading is an option when analyzing a list.
Can perform either full enumeration test or Monte Carlo, depending on which is optimal for a given data set.
Includes functions for determining the number of tables for a given set of allele counts

Suppose you sample 279 individuals from a population and determine their genotypes at a particular 3-allele locus. If the allele names are $a_1, a_2$ and $a_3$, the genotype counts might be: $$ {\bf{a}} = \left[ { \begin{array} {{20}{c}} {{a_{11}}}&{}&{}\ {{a_{21}}}&{{a_{22}}}&{}\ {{a_{31}}}&{{a_{32}}}&{{a_{33}}} \end{array}} \right] = \left[ {\begin{array}{{20}{c}} {83}&{}&{}\ {49}&{18}&{}\ {74}&{34}&{21} \end{array}} \right]\ $$

obs <- c(83, 49, 18, 74, 34, 21)

That is, we observed r obs[1] homozygotes for allele $a_1$, r obs[2] heterozygotes of genotype $a_1/a_2$, and so on. The genotype counts are in the lower half of a $3 \times 3$ matrix. (These values come from r Citet(morin2012, .opts=list(max.names=2))) We'll enter the numbers by rows in an R vector:

obs <- c(83, 49, 18, 74, 34, 21)

To test whether these numbers fit the Hardy-Weinberg proportions, call the function hwx.test

result <- hwx.test(obs)
result

The output tells us that there are r as.integer(result$tableCount) possible samples of r sum(obs) diploids with the same allele counts as the observed data. The $P$ values were computed by looking at each of those possibilities and adding up the total probability of all tables which deviate from HW by at least as much as the observed.

But why does it report four different $P$ values? The reason is that there are several ways one can measure just how deviant a particular outcome is compared to the HW expectation. These measures are explained in more detail below. Briefly, if you had no prior expectation that there might be too many homozygotes, then the first (LLR) $P$ value is recommended. However, if you had a prior suspicion that homozygotes would be in excess, such as from inbreeding, population admixture, null alleles, etc., then the third choice (U) would be preferable. The other two are included for comparison and for users who prefer them.

Plot a histogram of the test statistic

The hwx.test function can also show a plot of how the test statistic is actually distributed, as opposed to its theoretical distribution for large samples. To do this, specify the number of bins to use in the histogram or just use T for the default of 500 bins:

hwx.test(obs, histobins=T)

The red area represents those outcomes found to be more extreme than the observed data by the LLR criterion. In other words, the red area represents the $P$ value. The blue curve shows the theoretical $\chi^2$ distribution for comparison. In this case the fit is pretty good.

We can do the same thing for the any of the other test statistics by specifying the statName:

hwx.test(obs, histobins=T, detail=0, statName="U")

Note that setting the parameter detail to zero suppresses the re-printing of the numbers.

Example with real data: bowhead whales

A set of data from P. r Citet(morin2012) includes genotypes of 280 whales from one population and 49 from another. Each individual was classified at 51 genetic loci. Some of these loci had more than 20 alleles. We can test for Hardy-Weinberg fit for each locus and each population:

data(whales.df)
wtest <- hwx.test(whales.df)

Note that the format of the data is that of a data frame where each row has the ID of an individual whale followed by its population ID and the the genotype at each of the 51 loci. The result includes r length(listify(wtest)) individual tests, so let's not print them all out. Instead, we'll use the function hwdf to put the results in the form of a data frame, and look at just the first 10 in the list:

dfwhales <- hwdf(wtest)
dfwhales[1:10,]

The first column shows the population, such as "P1," and locus, such as "BmC5R700_Y9101" of each sample. The second column is the $P$ value from the LLR criterion by default. N is the number of diploid individuals in the sample and k is the number of alleles.

Note that there is a standard error associated with the last two tests, but not the others. This is because those two tests were done by the Monte Carlo method rather than full enumeration. The hwx.test function decides which method to use based on how many tables would need to be examined in order to do the full enumeration. In these two samples, the number of tables would be r acount(wtest$P1$Bmys42aK46_R225_K232$genotypes) and r as.integer(xcount(wtest$P1$Bmys43Y237_Y377$genotypes)) respectively, and that would take too long. Instead, hwx.test performed 100,000 random trials to estimate the $P$ value in each case.

If you need a more precise $P$ value than the one provided by 100,000 trials of the Monte Carlo test, there are two ways to do it. One way is to simply increase the Monte Carlo sample size. Suppose we want to do that for the 7-allele case, P1.Bmys42aK46_R225_K232, but not for the whole data set. We can start by pulling out the genotype counts for that specific test. These genotype counts are contained in the wtest object and can be picked out as follows:

counts1 <- wtest$P1$Bmys42aK46_R225_K232$genotypes
counts1

Now we can call hwx.test again, but increase the sample size to one million by changing the parameter B:

hwx.test(counts1, detail=1, B=1000000)

The standard error of the $P$ value is now considerably smaller. For the other sample, P1.Bmys43Y237_Y377, the number of tables is not as huge, and we can obtain an exact $P$ value by increasing the cutoff number. The following tells hwx.test to perform the full enumeration test instead of the Monte Carlo unless the number of tables exceeds 2 billion:

counts2 <- wtest$P1$Bmys43Y237_Y377$genotypes
hwx.test(counts2, detail=1, cutoff=2e8)

This time, the $P$ value is reported without a standard error, indicating that an exact test with full enumeration was performed.

The results contained in wtest includes $P$ values for all four criteria. So, for example, if we were interested in the U test rather than the LLR we could have specified that in the data frame call:

hwdf(wtest, statName="U")[1:10,]

When using the U test, it is important to pay attention to the sign of U. If it is negative, as in the first two samples, the test is specifically for heterozygote excess. If it is positive, then we are testing for homozygote excess. This point will be discussed further below.

Reading GenePop data

Published data with genotype frequencies are often available in the GenePop format r Citep(genepop007). These data can be easily converted to genind objects and then fed to hwx.test. For example, r Citet(hart2012) provide a supplemental file containing genotype counts at two loci in two populations of sea urchins. We can pull those numbers in directly from the online source and analyze them as follows:

urchin.url <- "http://tinyurl.com/ku4fq7m"
urchin.data <- genepop.to.genind(urchin.url)
hwdf(hwx.test(urchin.data))

Standard asymptotic tests are often inaccurate

Most introductory genetics classes teach that the way to test for Hardy-Weinberg proportions is the standard $\chi^2$ test with $k(k-1)/2$ degrees of freedom. (I am guilty of this myself in teaching Genetics 466 at the University of Wisconsin!) The problem is that with real data and multiple alleles, there are often rare alleles which make the asymptotic test unreliable. For example, let's compare the actual $P$ values for the bowhead whale data as computed in the previous example with the corresponding $P$ values from the ordinary $\chi^2$ test. First we make a data frame from wtest and extract the true $P$ values from the first column and the approximations from the $10^{th}$ column:

df <- hwdf(wtest, showAsymptoticX2=T)
pLLR <- df[[1]]
pAsy <- df[[10]]
par(mfcol=c(1,2))
plot(pLLR, pAsy, xlab="True P value", ylab = "Asymototic approximation")
abline(0,1)
lim <- c(0,.2)
plot(pLLR, pAsy, xlim=lim, ylim=lim, xlab="True P value", ylab="")
abline(0,1)

The error for each point is the vertical distance from the diagonal. The right-hand plot is zoomed to show just the smallest $P$ values and see the extent of the errors in the most critical ranges:

It's clear that the points are so widely scattered as to make the asymptotic tests essentially useless, even though the sample sizes are not small (especially for whales!). This analysis might be a bit too pessimistic, though, since it compares the true $P$ value based on the likelihood ratio against the approximate one based on the $X^2$ statistic. Therefore, to make the $P$ values more comparable, we will repeat the comparison, but with the true $X^2$ $P$ value versus the approximate one:

dfx <- hwdf(wtest, statName="Chisq")
pX2 <- dfx[[1]]
par(mfcol=c(1,2))
plot(pX2, pAsy, xlab="True P value", ylab = "Asymototic approximation")
abline(0,1)
plot(pX2, pAsy, xlim=lim, ylim=lim, xlab="True P value", ylab = "")
abline(0,1)

There is still an unacceptable amount of scatter, but not quite as much as previously. To see the problem more clearly, let's focus on one particular sample. Specifically, look at locus $Bmy26$ from population $P2$:

wtest$P2$Bmy26

The genotype counts are contained within each test result of the wtest object, and we can extract them with the $genotypes specification. Then we'll re-analyze the data with hwx.test and plot the distribution of the $X^2$ statistic. While we're at it, let's increase the Monte Carlo sample size just to make a smoother curve:

counts <- wtest$P2$Bmy26$genotypes
hwx.test(counts, detail=0, statName="Chisq", histobins=T, histobounds=c(50, 250), B=1e6)

Now the problem is clear! The shape of the true distribution is very different from the blue curve. The true curve has a shoulder on the right side. This shoulder is smaller than the main part of the curve, but it accounts for most of the $P$ value, as indicated by the red-shaded area. There are r length(wtest$P2$Bmy26$alleles) alleles at the $Bmy26$ locus, and some of them are infrequent. The shoulder of the distribution represents those potential outcomes in which a rare genotype occurred. It is well known that the asymptotic $\chi^2$ distribution is not accurate when there are any small expected numbers.

Choice of test statistics

To compute an exact $P$ value, one must have some way to determine which potential outcomes deviate from the expectations by at least as much as the observed. The catch is that there are different ways to define how much a particular outcome deviates from Hardy-Weinberg proportions. The choice of which criterion to use depends on the situation and on one's personal preference. For that reason, HWxtest provides a choice of four test statistics. You can think of the test statistics as providing a way of ordering the potential outcomes from best to worst in terms of how well they fit Hardy-Weinberg. Whenever you run hwx.test all four $P$ values are calculated, but it is best to decide ahead of time which one you will use. That is, it's not fair to pick and choose after the fact.

Here are the test statistics used by HWxtest to order the potential outcomes and decide which are at least as deviant from HW as the actual observed outcome:

Likelihood Ratio ($LLR$) is simply the probability of a given outcome if HW proportions are true divided by the probability of the same outcome if the population's true genotype frequencies were anything else. For the denominator, assume the true genotype frequencies are exactly the same as in the outcome being considered. That is: $$ LR = \frac{{\Pr (outcome|HW)}}{{\Pr (outcome|{\rm{best fit}})}} $$ The numerator comes from r Citet(levene1949) and the denominator is a multinomial, so the formula for $LR$ works out to be: $$ LR = \frac{{\prod\limits_i {{m_i}^{{m_i}}} }}{{{2^{n + d}}{n^n}\prod\limits_{i \ge j} {{a_{ij}}^{{a_{ij}}}} }} $$ where $a_{ij}$ are the genotype counts, $m_i$ are the allele counts, $n$ is the total sample size, and $d$ is the number of homozygotes. Note that the log of the likelihood ratio ($LLR$) orders the outcomes exactly the same as the $LR$ itself and is used interchangeably.
Probability. One can also use the conditional probability itself as a measure of deviation from HW. Those outcomes with a lower probability (likelihood) than the observed case can be said to deviate from HW more. This probability is conditional on the observed allele counts r Citep(levene1949). Here is the formula: $$ {\mathop{\rm P}\nolimits} \left( {\bf{a}} \right) = \frac{{{2^{n - d}}n!\prod {{m_i}!} }}{{\left( {2n} \right)!\prod\limits_{i \ge j} {{a_{ij}}!} }} $$
U Score. A common situation in testing for HW occurs when there is reason to suspect deviation from HW in a particular direction -- either an excess or a dearth of homozygotes. In particular, an excess of homozygotes occurs when there is inbreeding or population admixture. There is also an apparent excess of homozygotes when the presence of null alleles cause heterozygotes to be misclassified as homozygotes. In cases like these, it is better to use a test statistic that reflects the more specific alternative hypothesis. HWxtest uses the $U$ score proposed by r Citet(rousset1995) which is equivalent in terms of ordering the outcomes to ${\hat f_T}$ of r Citet(haldane1954) and r Citet(robertson1984): $$ U = 2n\sum\limits_{i = 1}^k {\frac{{{a_{ii}}}}{{{m_i}}}} - n $$ The $U$ score is positive when there is an excess of homozygotes and negative when heterozygotes are in excess.

Important Note about $U$ score test: When the $P$ value is reported, it is assumed that you are testing with an alternative hypothesis in which $U$ has the same sign as in the observed sample. For example, if your observed sample has $U = -3.2$, the reported $P$ value is the probability of getting an outcome in which $U$ is $-3.2$ or less. This is just a convenience feature of the software so that you don't have to specify which direction you wish to test. The idea is that if you observed a $U$ score in the opposite direction of your alternative hypothesis, you already know that the test is not significant and don't really care about the exact $P$ value. Of course, you would not want to decide on the alternative hypothesis after looking at the various $P$ values.

Chi squared: The same $X^2$ statistic used for the standard asymptotic test can also be used simply as a measure of how deviant an outcome is from the HW expectations. That is, we look at all possible outcomes (or a Monte Carlo sample of them) and find the total probability of an outcome greater than or equal to: $$ {X^2} = \sum\limits_{i \ge j} {\frac{{{{({a_{ij}} - expecte{d_{ij}})}^2}}}{{expecte{d_{ij}}}}} $$ The asymptotic $\chi^2$ distribution is not used in getting the $P$ value.

There are other test statistics that might be added to HWxtest in future versions. For example r Citet(ward2014) argue that the root-mean-square is a useful measure of departure from Hardy-Weinberg proportions.

So ... which $P$ value should you use? My suggestion is to use the $LLR$ if you have no prior reason for suspecting either an excess or lack of homozygotes. If you do, then use the $U$ score. The other two measures are included in HWxtest mainly for comparison. My reasons for preferring the $LLR$ to either the probability or $X^2$ are discussed elsewhere r Citep(engels2009)

Multithreading

On Unix-based systems, such as MacOS X, you can greatly speed up processing of multiple samples by using several cores. Do do this, simply set options variable mc.cores to something greater than 1. For example, if you have 8 cores you can use: