In sergihervas/iMKT: Integrative McDonald and Kreitman Test

knitr::opts_chunk$set(
    echo = TRUE,
    fig.align = "center",
    fig.height = 7,
    fig.width = 7,
    collapse = TRUE,
    comment = "#>"
)

A major issue in population genetics is the accurate detection of the impact of natural selection along the genome. In this regard, several tests have been developed in the last years, being the McDonald and Kreitman Test (Standard MKT, McDonald & Kreitman 1991) the most used method. Several modifications to the Standard MKT have been applied over the last years, leading to a battery of MKT-derived methods.

The Integrative McDonald and Kreitman Test (iMKT) package allows computing the McDonald and Kreitman test on polymorphism and divergence genomic data provided by the user or automatically downloaded from PopFly (Hervas et al. 2017) or PopHuman (Casillas et al. 2018).

It includes five MKT derived tests: Standard MKT, FWW correction (Fay et al. 2001), DGRP correction (Mackay et al. 2012), asymptotic MK (Messer & Petrov 2013), and the novel iMKT approach; which allow inferring the rate of adaptive evolution ($\alpha$), as well as the fraction of strongly deleterious ($d$), weakly deleterious ($b$), and neutral ($f$) sites, using user custom input data.

In this document, we show how to install the package and how to analyze the sample data using different MKT-derived methods. Finally, we briefly discuss the results obtained with each of the available tests. This vignette contains three different sections: \begin{itemize} \item Loading the package and checking test data \item Performing MKT analyses \item Conclusion: comparison of the different MKT estimates \end{itemize}

Loading the package and checking test data

First of all, install (if this is not done yet) and load the package. Notice that iMKT package includes two sample dataframes named myDafData and myDivergenceData which are the ones used in this tutorial and correspond to the chromosome arm 2R of the North American Raleigh population (RAL, North Carolina, $n=205$), using D. simulans as outgroup species to estimate divergence metrics. Therefore, it is possible to replicate the vignettes in order to better understand all the package functionalities.

## Load package
# install.packages("devtools")
# devtools::install_github("sergihervas/iMKT")
library(iMKT)

## Sample daf data
head(myDafData)

## Sample divergence data
myDivergenceData

The iMKT package includes several functions, classified as follows:

\begin{itemize} \item Calculation of MKT-derived methods \begin{itemize} \item standardMKT(): Standard MKT \item FWW(): FWW correction \item DGRP(): DGRP correction
\item asymptoticMKT(): Asymptotic MKT \item iMKT(): integrative MKT
\item completeMKT(): perform all previous tests \end{itemize}

\item iMKT using PopFly and PopHuman data \begin{itemize} \item loadPopFly(): load PopFlyData \item loadPopHuman(): load PopHumanData \item PopFlyAnalisys(): perform any test using PopFlyData \item PopHumanAnalysis(): perform any test using PopHumanData \end{itemize}

\item Miscelanious \begin{itemize} \item checkInput(): check data before performing analyses \item themePublication(): output plots and tables styling \end{itemize} \end{itemize}

Each function has an associated help page with its description, details about its parameters, usage, examples and so on. Rembember you can access it writting ? and the function name (or ?library::function) in your console (example: ?iMKT::standardMKT).

This vignette focuses on the first category of functions: "Calculation of MKT-derived methods". Specifically, it contains examples of each function using the sample data described above. For details about functions from "iMKT using PopFly and PopHuman data" category check the corresponding vignette. The functions from the third category are used within other functions and do not produce analyses output.

Performing MKT analyses

The diverse functions from this category have two common input parameters which are required to perform the corresponding test:

\begin{itemize} \item daf: data frame containing DAF, Pi and P0 values (myDafData) \item divergence: data frame containing divergent and analyzed sites for selected (i) and neutral (0) classes (myDivergenceData) \end{itemize}

The output of each function always contains the corresponding alpha estimate, together with specific values and details of the selected methodology.

Standard MKT

Brief theoretical description about MKT

The standardMKT() function uses daf and divergence input parameters and returns as output a list containing:

\begin{itemize} \item alpha symbol: estimate of alpha using the standard MKT \item Fisher exact test P-value: p-value obtained using the Fisher exact test on a 2x2 contingency table (MKT table) \item MKT table: table containing the number of polymorphic and divergent sites for neutral and selected classes. \item Divergence metrics: table containing estimates of Ka, Ks, omega, omegaA, omegaD. \end{itemize}

standardMKT(daf=myDafData, divergence=myDivergenceData)

FWW correction

Alpha estimates can be biased by the segregation of slighlty deleterious substitutions. One method to partially controll this effect is to remove low frequency polymorphisms from the analysis, as proposed by Fay et al. (2001). Using this correction, all polymorphic sites from both neutral and selected classes which have a derived allele frequency lower than the pre-defined cutoff are removed for the calculation of alpha.

The FWW() function uses daf and divergence input parameters, along with a default list of cutoffs (0, 0.05, 0.1) and returns as output a list containing:

\begin{itemize} \item Results: alpha estimates (and their associated Fisher exact test P-value) for each cutoff. \item Divergence metrics: global metrics (Ka, Ks, omega) and estimates by cutoff (omegaA, omegaD) \item MKT tables: tables containing the number of polymorphic and divergent sites for neutral and selected classes for each cutoff. \end{itemize}

FWW(daf=myDafData, divergence=myDivergenceData)

By default the argument listCutoffs uses a list of cutoffs with the following values: 0, 0.05, 0.1. Moreover, the function has an optional argument, plot, which is set to FALSE by default. This parameters can be customized, like in the following example, where we use a list of 4 cutoffs (0.05, 0.15, 0.25, 0.35) and set the plot argument to TRUE.

The output in this case contains a Graph which shows the adaptation value (alpha) obtained using each cutoff.

FWW(daf=myDafData, divergence=myDivergenceData, listCutoffs=c(0.05, 0.15,0.25,0.35), plot=TRUE)

DGRP correction

To take adaptive and slightly deleterious mutation mutually into account, Pn , the count of segregating sites in the non-synonymous class, should be separated into the number of neutral variants and the number of weakly deleterious variants, Pn = Pn(neutral) + Pn(weakly del.). If both numbers are estimated, adaptive and weakly deleterious selection can be evaluated independently.

Consider a pair of 2×2 contingency tables. The first one corresponds to the standard MKT table with the theoretical counts of segregating sites and divergent sites for each cell.

The second table contains the count of Pn and Ps for two-frequency categories: below and over a threshold cutoff.

Add brief explanation about 2nd table!

To estimate alpha from the standard MKT table correcting by the segregation of weakly deleterious variants, we have to substitute the Pn by the expected number of neutral segregating sites, Pn(neutral). The correct estimate of alpha is then alpha = 1 - (Pn (neutral)/Ps)(Ds/Dn).

The DGRP() function behaves similar to the FWW() function. It takes the same input argument and returns the same output but containing also estimates on the fractions of negative selection (d: strongly deleterious, f: neutral and b: weakly deleterious).

DGRP(daf=myDafData, divergence=myDivergenceData)

Again, by default the argument listCutoffs uses a list of cutoffs with the following values: 0, 0.05, 0.1, and the argument plot is set to FALSE. This parameters can be customized, like in the following example, where we use a list of 4 cutoffs (0.05, 0.15, 0.25, 0.35) and set the plot argument to TRUE.

The output in this case contains two Graphs which show the adaptation value (alpha) and the negative selection fractions obtained using each cutoff.

DGRP(daf=myDafData, divergence=myDivergenceData, listCutoffs=c(0.05, 0.15,0.25,0.35), plot=TRUE)

Asymptotic MKT

Petrov reference + explanation

This function is adapted from the code developed in "Haller BC, Messer PW. asymptoticMK: A Web-Based Tool for the Asymptotic McDonald-Kreitman Test. G3 (Bethesda). 2017 May 5;7(5):1569-1575", stored in: http://github.com/MesserLab/asymptoticMK. The main adaptation we did is that the function presented here only fits an exponential model, removing the linear fitting performed initially in the cases where it was not possible to fit an asymptotic curve.

The asymptoticMKT() function uses the common daf and divergence parameters along with two arguments which define the lower and higher limit for the asymptotic alpha fit (xlow and xhigh). These two optional parameters are set to 0 and 1 by default, although it is recommended to use a higher limit of 0.9 in order to remove possible biases due to polarization error.

The function's output is a table with: the model type (exponential) along with the fitted function values (a, b, c), the asymptotic alpha estimate with its corresponding lower and higher confidence interval values, and the original alpha estimate (using the standard MKT methodology and the polymorphic sites within the xlow and xhigh cutoffs).

asymptoticMKT(daf=myDafData, divergence=myDivergenceData, xlow=0, xhigh=0.9)

iMKT

The integrative MKT combines the approach developed by Messer & Petrov (2013) to estimate the fraction of adaptive substitutions ($\alpha$) and an adaptation of the theoretical framework established by Mackay et al. (2012) to quantify the fraction of putatively selected sites that are under purifying selection pressures.

The iMKT() function takes the default input parameters (daf and divergence), the xlow and xhigh arguments (presented in the asymptoticMKT() function) and it also has the optional argument plot, set as FALSE by default. However, in this example we use plot=TRUE to display the graphical results.

The output of the function contains: \begin{itemize} \item Asymptotic MK table: table corresponding to the asymptoticMKT() function output. \item Fractions of sites: negative selection fractions (d: strongly deleterious, f: neutral and b: weakly deleterious). \item Graphs: 3 plots showing: (A) the distribution of alleles frequencies for neutral and selected sites, (B) the asymptotic alpha estimate with xlow, xhigh, original alpha and asymptotic alpha marks, and (C) the negative selection fractions. \end{itemize}

iMKT(daf=myDafData, divergence=myDivergenceData, xlow=0, xhigh=0.9, plot=TRUE)

Conclusion: comparison of the different MKT estimates

The following table includes the estimates of the rate of adaptive evolution ($\alpha$) obtained using the same data and each of the MKT methods included in the iMKT package. Results for FWW and DGRP corrections were produced using two different cut-offs (0.05 and 0.1).

results <- data.frame("Standard"=0.2365, "FWW_0.05"=0.5409, "FWW_0.1"=0.5798, "DGRP_0.05"=0.4249, "DGRP_0.1"=0.4126, "asymptotic_iMKT"=0.6259)
knitr::kable(results, align="c")
rm(results)

We observe that the Standard MKT tends to underestimate the true rate of adaptation, probably due to the presence of weakly deleterious mutations which reduce the power of the test to detect adaptive evolution.

Besides, we observe an increase of α when using any of the other methods Comment results and strengths and weakness of each method. Specifically, FWW correction seems to perform better than DGRP. Finally, the asymptotic method allows to estimate the largest rate of adaptive evolution. As we are analyzing a complete chromosome arm with a large number of segregating sites (in all DAF categories), the asymptotic test is able to remove almost all weakly deleterious mutations from the analysis and thus, we obtain the α estimates closest to the true level of adaptation.

sergihervas/iMKT documentation built on May 3, 2019, 1:49 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com