In sergihervas/iMKT: Integrative McDonald and Kreitman Test

knitr::opts_chunk$set(
    echo = TRUE,
    fig.align = "center",
    fig.height = 7,
    fig.width = 7,
    collapse = TRUE,
    comment = "#>"
)

The improvement of NGS technologies that lead to the sequencing of hundreds or even thousands of individuals from the same species provides us with the necessary data on which to empirically test evolutionary theoretical hypotheses. The availability of such datasets increases the request of developing high performance specific tools able to deal and analyze this information.

The PopFly (Hervas et al. 2017) and PopHuman (Casillas et al. 2018) genome browsers are the two central repositories of population genomics estimates for D. melanogaster and Homo sapiens model species, respectively. Thus, we decided to provide functionalities to directly link the iMKT software to both databases, which allows retrieving and and analyzing population genomics information in a single step.

Therefore, iMKT package includes some functions which allow an easy retrieval and analysis of population genetics information stored in these genome browsers. Specifically, the functions permit the download of population genetics parameters computed for every gene annotation in several populations for both model species.

This vignette is divided in two main sections:

\begin{itemize} \item Loading the row data \item Performing iMKT analyses \end{itemize}

Although the examples presented here focus only on PopFly data and analyses, the same process could be performed using human data and functions (PopHumanData, loadPopHuman(), PopHumanAnalysis()), following the same steps described in this document. Recombination rate values included in human data were retrieved from Bhérer et al. 2017 Nature Commun. and correspond to the sex average estimates.

Loading the row data

The first step is to load the information into your working enviroment. This would allow a manual examination of the data before starting any alaysis. However, to speed analyses, this step can be skipped and the main data object is loaded into the workspace the first time that the corresponding analysis function is called. To download PopFly data use the loadPopFly() function, without any argument.

Keep in mind that you are downloading full-genome information of several populations (13,745 gene annotations for 16 populations in D.melanogaster and 18,145 gene annotations for 26 populationsin H. sapiens), and this procces could take a while.

Once the function finishes,the PopFlyData object is loaded into the workspace. This object can be manually examined or used when performing iMKT analyses.

## Load the iMKT library
library(iMKT)

## Load PopFlyData
loadPopFly()

## Check data object
ls()
names(PopFlyData)

Each row of the PopFlyData dataframe contains information regarding one gene annotation in one single population. Metrics for each gene contain information about the number of segrating, divergent and analyzed positions, and the Derived Allele Frequency distribution (DAF) for neutral (4fold) and putatively selected (0fold) sites; together with some neutrality tests statistics (Standard MKT, Direction of Selection, Ka/Ks) and gene-associated recombination rate estimates (retrieved from Comeron et al. 2012 Plos Genetics).

Once the data is loaded into the workspace, diverse iMKT analyses can be performed using the function PopFlyAnalysis().

Performing iMKT Analyses

The PopFlyAnalisys() function allows performing any MK test using a subset of PopFly data defined by custom genes and populations lists. It uses the previously loaded dataframe (PopFlyData). In addition to the genes and populations lists, the function also has the following parameters:

recomb: group genes according to recombination values (TRUE/FALSE)
bins: number of recombination bins to compute (mandatory if recomb=TRUE)
test: which test to perform. Options include: standardMKT (default), DGRP, FWW, asymptoticMKT, iMKT
xlow: lower limit for asymptotic alpha fit (default=0)
xhigh: higher limit for asymptotic alpha fit (default=1)
plot: report plot (optional, default=FALSE)

Custom genes must be listed using FlyBase IDs (FBgn...), and the available populations from PopFly are: AM, AUS, CHB, EA, EF, EG, ENA, EQA, FR, RAL, SA, SD, SP, USI, USW, ZI.

Hence, using the parameters recomb and bins, the function allows deciding whether to analyze genes groupped by recombination bins or not.

Example 1. DGRP analysis without recombination

In this first example, the analysis is focused in two genes and 2 populations, without considering gene's recombination context, and using DGRP methodology.

The function groups polymorphism and divergence values of the custom genes, creating a new "concatenated gene" for each population of interest and performs the test defined. Then, it returns a list of lists with the default test output (DGRP in this case) for each population (RAL and ZI).

Explain output of DGRP function.

PopFlyAnalysis(genes=c("FBgn0000055","FBgn0003016"), pops=c("RAL","ZI"), recomb=F, test="DGRP", plot=TRUE)

Example 2. iMKT analysis using recombination bins

In this second example, genes from RAL population are groupped in 2 recombination bins, using recombination values from (Comeron et al. 2012). The test used is iMKT, with xlow and xhigh values set to 0 and 0.9, respectively. We are analyzing the complete set of genes located in the chromosome 2R ($n=2,822$)

In this case, the function creates two different "concatenated" genes containing the same number of genes each (1,411 in this case), grouping again polymorphism and divergence values. These aggrupations are made according to the gene associated recombination rate estimates. The function returns a list of lists with the default test output (iMKT in this case) for each population (RAL) and recombination bin (1 and 2).

The output of the function is a list that contains:

\begin{itemize} \item Asymptotic MK table: table including information about the model type (exponential) along with the fitted function values (a, b, c), the α asymptotic estimate with its corresponding lower and higher confidence interval values, and the α original estimate (using the standard MKT methodology and the polymorphic sites within the xlow and xhigh cutoffs).

\item Fractions of sites: negative selection fractions ($d$: strongly deleterious, $f$: neutral and $b$: weakly deleterious).

\item Graphs: 3 plots showing: (A) the distribution of alleles frequencies (DAF) for neutral and selected sites, (B) the adaptation values ($\alpha$) for each DAF category along with the function fit, the α asymptotic and α original estimates and the limits used for function fitting and adaptation values calculation, and (C) the negative selection fractions. \end{itemize}

geneList <- as.vector(unique(PopFlyData[PopFlyData$Chr=="2R",]$Name))
PopFlyAnalysis(genes=geneList , pops="RAL", recomb=T, bins=2, test="iMKT", xlow=0, xhigh=0.9, plot=TRUE)
rm(geneList)