knitr::opts_chunk$set(tidy=FALSE, cache=TRUE,
                      dev="png",
                      message=FALSE, error=FALSE, warning=TRUE)

Current version: Development - r packageVersion("QTLseqr")

Introduction

Citations

If you use QTLseqr in published research, please cite:

Mansfeld B.N. and Grumet R, QTLseqr: An R package for bulk segregant analysis with next-generation sequencing The Plant Genome doi:10.3835/plantgenome2018.01.0006

We also recommend citing the paper for the corresponding method you work with.

QTL-seq method:

Takagi, H., Abe, A., Yoshida, K., Kosugi, S., Natsume, S., Mitsuoka, C., Uemura, A., Utsushi, H., Tamiru, M., Takuno, S., Innan, H., Cano, L. M., Kamoun, S. and Terauchi, R. (2013), QTL-seq: rapid mapping of quantitative trait loci in rice by whole genome resequencing of DNA from two bulked populations. Plant J, 74: 174–183. doi:10.1111/tpj.12105

G prime method:

Magwene PM, Willis JH, Kelly JK (2011) The Statistics of Bulk Segregant Analysis Using Next Generation Sequencing. PLOS Computational Biology 7(11): e1002255. doi.org/10.1371/journal.pcbi.1002255

Quick Start

Here are the basic steps required to run and plot QTLseq and $G'$ analysis

# download and install the devtools package
install.packages("devtools")

# download the QTLseqr package from GitHub
devtools::install_github("bmansfeld/QTLseqr")
#load the package
library("QTLseqr")

#Set sample and file names
HighBulk <- "SRR834931"
LowBulk <- "SRR834927"
file <- "SNPs_from_GATK.table"

#Choose which chromosomes will be included in the analysis (i.e. exclude smaller contigs)
Chroms <- paste0(rep("Chr", 12), 1:12)

#Import SNP data from file
df <-
    importFromGATK(
        file = file,
        highBulk = HighBulk,
        lowBulk = LowBulk,
        chromList = Chroms
    )

#Filter SNPs based on some criteria
df_filt <-
    filterSNPs(
        SNPset = df,
        refAlleleFreq = 0.20,
        minTotalDepth = 100,
        maxTotalDepth = 400,
        minSampleDepth = 40,
        minGQ = 99
    )

#Run G' analysis
df_filt <- runGprimeAnalysis(SNPset = df_filt,
                             windowSize = 1e6,
                             outlierFilter = "deltaSNP")

#Run QTLseq analysis
df_filt <- runQTLseqAnalysis(
    SNPset = df_filt,
    windowSize = 1e6,
    popStruc = "F2",
    bulkSize = c(300, 450),
    replications = 10000,
    intervals = c(95, 99)
)

#Plot
plotQTLStats(
    SNPset = df_filt,
    var = "Gprime",
    plotThreshold = TRUE,
    q = 0.01
)

plotQTLStats(
    SNPset = df_filt,
    var = "deltaSNP",
    plotIntervals = TRUE)

#export summary CSV
getQTLTable(
    SNPset = df_filt,
    alpha = 0.01,
    export = TRUE,
    fileName = "my_BSA_QTL.csv"
)

Standard workflow

Installation

Let's install and load the QTLseqr package:

#Install step if you have not done so yet:
#install.packages("devtools")

devtools::install_github("bmansfeld/QTLseqr")
library("QTLseqr")

Great! We now need to load some data. QTLseqr has with a sister package Yang2013data (derived from Yang et al. (2013)], that has some trial data you can play with.

Input data

QTLseqr supports importing data either from table file exported from the VariantsToTable function built in to GATK or a delimited file containing allele read depths for each bulk. The two available functions are importFromGATK and importFromTable.

Both functions import the SNP data and perform some preliminary calculations to find the following:

$$Reference\ allele\ frequency = \frac{Ref\ allele\ depth_{HighBulk} + Ref\ allele\ depth_{LowBulk}}{Total\ read\ depth\ for\ both\ bulks}$$

$$SNP\text{-}index_{per\ bulk} = \frac{Alternate\ allele\ depth}{Total\ read\ depth}$$

$$\Delta (SNP\text{-}index) = SNP \text{-} index_{High Bulk} - SNP\text{-}index_{Low Bulk}$$

To demonstrate the use of the import functions we will first load the Yang et al. (2013) data file. We first need to download the package that contains the data from github.

#download and load data package (~50Mb)
devtools::install_github("bmansfeld/Yang2013data")
library("Yang2013data")

#Import the data
rawData <- system.file(
    "extdata", 
    "Yang_et_al_2013.table", 
    package = "Yang2013data", 
    mustWork = TRUE)

If you have your own data you can simply refer to it directly:

rawData <- "C:/PATH/TO/MY/DIR/My_BSA_data.table"

We define the sample name for each of the bulks. We also can define a vector of the chromosomes to be included in the analysis (i.e. exclude smaller contigs), In this case, Chr1, Chr2 ... Chr12.

HighBulk <- "SRR834931"
LowBulk <- "SRR834927"
Chroms <- paste0(rep("Chr", 12), 1:12)

Importing SNPs from GATK

Working directly with the GATK best practices guide for whole genome sequence should result in a VCF that is compatible with QTLseqr. In general the workflow suggested by GATK is per-sample variant calling followed by joint genotyping across samples. This will produce a VCF file that includes BOTH bulks, each with a different sample name (here SRR834927 and SRR834931), one SNP for example:

library(kableExtra)
x <- data.frame(CHROM = "Chr1", POS = 31071, ID = ".", REF = "A", ALT = "G", QUAL = 1390.44, FILTER = "PASS", INFO = "..\\*...", FORMAT = "GT:AD:DP:GQ:PL", SRR834927 = "0/1:34,36:70:99:897,0,855", SRR834931 = "0/1:26,22:48:99:522,0,698")
kable_styling(knitr::kable(x = x, format = "latex", booktabs = TRUE), latex_options = "scale_down")

*info column removed for brevity

GATK have provided a fast VCF parser, the VariantsToTable tool, that extracts the necessary fields for easy use in downstream analysis.

We highly recommend reading What is a VCF and how should I interpret it? for more information on GATK VCF Fields and Genotype Fields

Though the use of GATK's VariantsToTable function is out of the scope of this vignette, the syntax for use with QTLseqr should look something like this:

```{bash, eval=FALSE} java -jar GenomeAnalysisTK.jar \ -T VariantsToTable \ -R ${REF} \ -V ${NAME} \ -F CHROM -F POS -F REF -F ALT \ -GF AD -GF DP -GF GQ -GF PL \ -o ${NAME}.table

Where `${REF}` is the reference genome file and `${NAME}` is VCF file you wish to parse.

To run QTLseqr successfully, the required VCF fields `(-F)` are CHROM (Chromosome) and POS (Position). the required Genotype fields `(-GF)` are AD (Allele Depth), DP (Depth). Recommended fields are REF (Reference allele) and ALT (Alternative allele) Recommended Genotype fields are PL (Phred-scaled likelihoods) and  GQ (Genotype Quality).

#### The `ImportFromGATK` function

The `importFromGATK` function imports SNP data from the output of the VariantsToTable function in GATK. After importing the data, the function then calculates total reference allele frequency for both bulks together, the SNP index for each bulk, and the $\Delta (SNP\text{-}index)$.

We then use the `importFromGATK` function to import the raw data. After importing the data, the function then calculates total reference allele frequency for both bulks together, the $SNP\text{-}index$ for each SNP in each bulk and the $\Delta (SNP\text{-}index)$ and returns a data frame.

Let's import:
```r
#import data
df <-
    importFromGATK(
        file = rawData,
        highBulk = HighBulk,
        lowBulk = LowBulk,
        chromList = Chroms
     )

Importing from a delimited file

You can also import SNPs from a csv, tsv or any other delimited file. Your file must include some necessary columns: CHROM (Chromosome names) and POS (the SNP position) as well as the reference and alternate allele depths (number of reads supporting each allele). The allele depths should be in columns named in this format: AD_\<ALT/REF>.\<sampleName>. For example, the column for alternate allele depth for a high bulk sample named "sample1", should be "AD_ALT.sample1". Any other columns describing the SNPs are allowed, ie the actual allele calls, or a quality score. If the column is Bulk specific, It should be named columnName.sampleName, i.e "QUAL.sample1".

Here is the header of an example file:

x <-
    data.frame(
    CHROM = rep("Chr1", 5),
    POS = c(31071, 31478, 33667, 34057, 35239),
    REF = c("A", "C", "A", "C", "A"),
    ALT = c("G", "T", "G", "T", "C"),
    AD_REF.SRR834927 = c(34, 34, 20, 38, 25),
    AD_ALT.SRR834927 = c(36, 52, 48, 40, 36),
    GQ.SRR834927 = rep(99, 5),
    AD_REF.SRR834931 = c(26, 40, 24, 29, 40),
    AD_ALT.SRR834931 = c(22, 34, 29, 26, 60),
    GQ.SRR834931 = rep(99, 5)
    )

kableExtra::kable_styling(knitr::kable(x = x, format = "latex", booktabs = TRUE), latex_options = "scale_down")

The importFromTable function

To load the data from a delimited file we use the importFromTable function. in our case it is a csv so it is comma delimited. The columns are denoted by the sample names for the bulks (SRR834931 and SRR834927). These names will be renamed to "HIGH" and "LOW". If you want you can set the column names as such in advance as the defaults for highBulk and lowBulk arguments are "HIGH" and "LOW", respectively. As in the importFromGATK function the chromList argument should be a string vector of the chromosomes to be used in the analysis. Useful for filtering out unwanted contigs etc.

df <- importFromTable(file = "Yang2013.csv",
                      highBulk = HighBulk,
                      lowBulk = LowBulk,
                      chromList = Chroms)

Loaded data frame

The loaded data frame should look like this. This data frame has other information about the genotype calls from GATK:

head(df)

Let's review the column headers:

Filtering SNPs

Now that we have loaded the data into R we can start cleaning it up by filtering some of the low confidence SNPs. While GATK has its own filtering tools, QTLseqr offers some options for filtering that may help reduce noise and improve results. Filtering is mainly based on read depth for each SNP, such that we can try to eliminate SNPs with low confidence, due to low coverage, and SNPs that may be in repetitive regions and thus have inflated read depth. You can also filter based on the absolute difference in read depth between the bulks. If you have your own quality columns you can always use R's base subsetting functions to filter out any SNPs you don't want.

Read depth histograms

One way to assess filtering thresholds and check the quality of our data is by plotting histograms of the read depths. We can get an idea of where to draw our thresholds. We'll use the ggplot2 package for this purpose, but you could use base R to plot as well.

Lets look at total read depth for example:

library("ggplot2")
ggplot(data = df) + 
    geom_histogram(aes(x = DP.HIGH + DP.LOW)) + 
    xlim(0,1000)

...or look at total reference allele frequency:

ggplot(data = df) +
    geom_histogram(aes(x = REF_FRQ))

We can plot our per-bulk SNP-index to check if our data is good.We expect to find two small peaks on each end and most of the SNPs should be approximiately normally distributed arround 0.5 in an F2 population. Here is the HIGH bulk for example:

ggplot(data = df) +
    geom_histogram(aes(x = SNPindex.HIGH))

Using the filterSNPs function

Now that we have an idea about our read depth distribution we can filter out low confidence SNPS. In general we recommend filtering extremely low and high coverage SNPs, either in both bulks (minTotalDepth/maxTotalDepth) and/or in each bulk separately (minSampleDepth). We have the option to filter based on reference allele frequency (refAlleleFreq), this removes SNPs that for some reason are over- or under-represented in BOTH bulks. We can also filter SNPs that have large discrepancies in read depth between the bulks(i.e. one bulk has a depth of 500 and the other has 5). Such discrepancies can throw off the G statistic. We can also use the GATK GQ score (Genotype Quality) to filter out low confidence SNPs. If the verbose parameter is set to TRUE (default) the function will report the numbers of SNPs filtered in each step.

df_filt <-
    filterSNPs(
        SNPset = df,
        refAlleleFreq = 0.20,
        minTotalDepth = 100,
        maxTotalDepth = 400, 
        depthDifference = 100,
        minSampleDepth = 40,
        minGQ = 99,
        verbose = TRUE
    )
df_filt <-
    filterSNPs(
        SNPset = df,
        refAlleleFreq = 0.20,
        minTotalDepth = 100,
        maxTotalDepth = 400,
        depthDifference = 100,
        minSampleDepth = 40,
        minGQ = 99,
        verbose = TRUE
    )

This step is quick and we can go back and plot some histograms to see if we are happy with the results, and we can quickly re-run the filtering step if not.

Running the analysis

The analysis in QTLseqr is an implementation of both pipelines for bulk segregant analysis, $G'$ and $\Delta (SNP\text{-}index)$, described by Magwene et al. (2011) and Takagi et al. (2013), respectively. We recommend reading both papers to fully understand the considerations and math behind the analysis.

There are two main analysis functions:

  1. runGprimeAnalysis - performs Magwene et al type $G'$ analysis
  2. runQTLseqAnalysis - performs Takagi et al type QTLseq analysis

A note about window sizes

QTLseqr utilizes a Nadaraya-Watson smoothing kernel to produce tricube-smoothed statistics for analysis. These smoothed statistics function as a weighted moving average across neighboring SNPs that accounts for linkage disequilibrium (LD), while minimizing noise attributed to SNP calling errors (Magwene et al., 2011). In a tricube-weighted window, SNPs that are close to the focal SNP have a high weighting value, while SNPs closer to the edge of the window have low weights. The tricube-smoothed values are predicted by constant local regression within each chromosome. The calculations are performed using the locfit function from the locfit package using a user defined window size and the degree of the polynomial set to zero.

As this window is using a tricube-smoothing kernel, the window size can be much larger than you might expect. However, in the rice examples below, we choose a window size of 1Mb for the sliding window analysis to replicate the orignal results published in Yang et al., (2013). For a discussion about window size, we recommend reading Magwene et al. (2011). In general, larger windows will produce smoother data. The functions making these calculations are rather fast, so we recommend testing several window sizes for your data, and then deciding on the optimal size.

When running either analysis functions above, some users will get an "newsplit: out of vertex space" error, coming from the locfit function. This usually happens when either the window size is set too small (it is a memory allocation error), or the organsim genome is very large and thus the default window size becomes too small. In such cases, the user may pass a higher maxk value to either analysis function (the default in locfit.raw is 100). However, if you are getting that warning, you should seriously consider increasing your window size, as it is probably too small to effectively manage the noise in your data. Please read the literature and make educated choices about your selected window size.

QTLseq analysis

Takagi et al. (2013) developed the method for QTLseq type NGS-BSA. The analysis is based on calculating the allele frequency differences, or $\Delta (SNP\text{-}index)$, from the allele depths at each SNP. To determine regions of the genome that significantly differ from the expected $\Delta (SNP\text{-}index)$ of 0, a simulation approach is used. Briefly, at each read depth, simulated SNP frequencies are bootstrapped, and the extreme quantiles are used as simulated confidence intervals. The true data are averaged over a sliding window and regions that surpass the CI are putative QTL.

When the analysis is run the following steps are performed:

  1. First the number of SNPs within the sliding window are counted.

  2. A tricube-smoothed $\Delta (SNP\text{-}index)$ is calculated within the set window size.

  3. The minimum read depth at each position is calculated and the tricube-smoothed depth is calculated for the window.

  4. The simulation is performed for data derived read depths (can be set by the user): Alternate allele frequency is calculated per bulk based on the population type and size (F2 or RIL) $\Delta (SNP\text{-}index)$ is simulated over several replications (default = 10000) for each bulk. The quantiles from the simulations are used to estimate the confidence intervals. Say for example the 99th quantile of 10000 $\Delta (SNP\text{-}index)$ simulations represents the 99% confidence interval for the true data.

  5. Confidence intervals are matched with the relevant window depth at each SNP.

Here is an example for running the analysis for an F2 population. In Yang et al. (2013), the bulks are of different sizes (385 and 430 for high and low bulk respectively), so we set bulkSize = c(385, 430). If your bulks are the same size you can simply set one value, i.e. bulkSize = 25. The simulation is bootstrapped 10000 times and the two-sided 95 and 99% confidence intervals are calculated:

df_filt <- runQTLseqAnalysis(df_filt,
    windowSize = 1e6,
    popStruc = "F2", 
    bulkSize = c(385, 430), 
    replications = 10000, 
    intervals = c(95, 99)
    )
df_filt <- runQTLseqAnalysis(
    df_filt,
    windowSize = 1e6,
    popStruc = "F2", 
    bulkSize = c(385, 430),
    replications = 10000, 
    intervals = c(95, 99)
    )

G' analysis

An alternate approach to determine statistical significance of QTL from NGS-BSA was proposed by Magwene et al. (2011) – calculating a modified G statistic for each SNP based on the observed and expected allele depths and smoothing this value using a tricube smoothing kernel. Using the smoothed G statistic, or G’, Magwene et al. allow for noise reduction while also addressing linkage disequilibrium between SNPs. Furthermore, as G’ is close to being log normally distributed, p-values can be estimated for each SNP using non-parametric estimation of the null distribution of G’. This provides a clear and easy-to-interpret result as well as the option for multiple testing corrections.

Here, we will briefly summarize the steps performed by the main analysis function, runGprimeAnalysis.

The following steps are performed:

  1. First the number of SNPs within the sliding window are counted.

  2. A tricube-smoothed $\Delta (SNP\text{-}index)$ is calculated within the set window size.

  3. Genome-wide G statistics are calculated by getG. $G$ is defined by the equation:

    $$G = 2 * \sum n_i * ln(\frac{obs(n_i)}{exp(n_i)})$$

    Where for each SNP, $n_i$ from i = 1 to 4 corresponds to the reference and alternate allele depths for each bulk, as described in the following table:

    |Allele|High Bulk|Low Bulk| |------|---------|--------| |Reference| $n_1$ | $n_2$ | |Alternate| $n_3$ | $n_4$ |

    ...and $obs(n_i)$ are the observed allele depths as described in the data frame. getG calculates the G statistic using expected values assuming read depth is equal for all alleles in both bulks: $$ exp(n_1) = \frac{(n_1 + n_2)(n_1 + n_3)}{(n_1 + n_2 + n_3 + n_4)} $$ $$ exp(n_2) = \frac{(n_2 + n_1)(n_2 + n_4)}{(n_1 + n_2 + n_3 + n_4)} $$ $$ exp(n_3) = \frac{(n_3 + n_1)(n_3 + n_4)}{(n_1 + n_2 + n_3 + n_4)} $$ $$ exp(n_4) = \frac{(n_4 + n_2)(n_4 + n_3)}{(n_1 + n_2 + n_3 + n_4)} $$

  4. G' - A tricube-smoothed G statistic is predicted by constant local regression within each chromosome using the tricubeStat function. This works as a weighted average across neighboring SNPs that accounts for Linkage disequilibrium (LD) while minimizing noise attributed to SNP calling errors. G values for neighboring SNPs within the window are weighted by physical distance from the focal SNP.

  5. P-values are estimated based using the non-parametric method described by Magwene et al. 2011 with the function getPvals. Briefly, using the natural log of $G'$ a median absolute deviation (MAD) is calculated. The $G'$ set is trimmed to exclude outlier regions (i.e. QTL) based on Hampel's rule. An alternate method for filtering out QTL that we propose is using absolute $\Delta (SNP\text{-}index)$ values greater than a set threshold (default = 0.1) to filter out potential QTL. An estimation of the mode of the trimmed set is calculated using the mlv function from the package modeest. Finally, the mean and variance of the set are estimated using the median and mode and p-values are estimated from a log normal distribution.

  6. Negative Log10- and Benjamini-Hochberg adjusted p-values are calculated using p.adjust.

Let's run the function:

df_filt <- runGprimeAnalysis(df_filt,
    windowSize = 1e6,
    outlierFilter = "deltaSNP",
    filterThreshold = 0.1)
df_filt <- runGprimeAnalysis(df_filt,
    windowSize = 1e6,
    outlierFilter = "deltaSNP",
    filterThreshold = 0.1)

Some additional columns are added to the filtered data frame:

head(df_filt)

Plotting the data

QTLseqr offers two main plotting functions to check the validity of the $G'$ analysis and to plot genome-wide or chromosome specific QTL analysis plots.

G' distribution plots

Due to the fact that p-values are estimated from the null distribution of $G'$, an important check is to see if the null distribution of $G'$ values is close to log normally distributed. For this purpose we use the plotGprimeDist function, which plots the $G'$ histograms of both raw and filtered $G'$ sets (see P-value calculation above) alongside the log-normal null distribution (which is reported in the legend). We can also use this to test which filtering method (Hampel or DeltaSNP) estimates a more accurate null distribution. If you use the "deltaSNP" method plotting $G'$ distributions with different filter thresholds might also help reveal a better $G'$ null distribution.

plotGprimeDist(SNPset = df_filt, outlierFilter = "Hampel")
plotGprimeDist(SNPset =df_filt, outlierFilter = "deltaSNP", filterThreshold = 0.1)

QTL analysis plots

Now that we are happy with our filtered data and it seems that the $G'$ distribution is close to log-normal, we can finally plot some genome-wide figures and try to identify QTL.

Let's start by plotting the SNP/window distribution:

p1 <- plotQTLStats(SNPset = df_filt, var = "nSNPs")
p1

This is informative as we can assess if there are regions with extremely low SNP density.

More importantly lets identify some QTL by plotting the smoothed $\Delta (SNP\text{-}index)$ and $G'$ values. If we've performed QTLseq analysis we can also set plotIntervals to TRUE and plot the confidence intervals to identify QTL using that method.

p2 <- plotQTLStats(SNPset = df_filt, var = "deltaSNP", plotIntervals = TRUE)
p2

We can see that there are some regions that have $\Delta (SNP\text{-}index)$ that pass the confidence interval thresholds, and are putative QTL. The directionality of the $\Delta (SNP\text{-}index)$ is also important for $G'$ analysis. If the allele contributing to the trait is from the reference parent the $\Delta (SNP\text{-}index)$ should be less than 0. However, if the $\Delta (SNP\text{-}index) > 0$ then the contributing parent is the one with the alternate alleles.

Let's look at the $G'$ values to see if these regions are significant and pass the FDR (q) of 0.01.

p3 <- plotQTLStats(SNPset = df_filt, var = "Gprime", plotThreshold = TRUE, q = 0.01)
p3

Great! It looks like there are QTL identified on Chromosomes 1, 2, 5, 8 and 10. Based on the $\Delta (SNP\text{-}index)$ and $G'$ plots the QTL from Chr1 originates from the reference parent (Nipponbare rice, in this case) and the QTL on Chr8 was contributed by the other parent, for example.

We can also use the plotQTLStats function to the $-log_{10}(p\text{-}value)$. While this number is a direct derivative of $G'$ it can be more self explanatory for some. We can use the subset parameter to plot one or a few of the chromosomes, say for a close up figure of a QTL of interest. Here we look at the $-log_{10}(p\text{-}value)$ plots of Chromosomes 1 and 8:

QTLplots <- plotQTLStats(
    SNPset = df_filt, 
    var = "negLog10Pval", 
    plotThreshold = TRUE, 
    q = 0.01, 
    subset = c("Chr1", "Chr8")
    )
QTLplots

Extracting QTL data

Now that we've plotted and identified some putative QTL we can extract the data using two functions getSigRegions and getQTLTable.

Extracting significant regions

The getSigRegions function will produce a list in which each element represents a QTL region. The elements are subsets from the original data frame you supplied. Any contiguous region above with an adjusted p-value above the set alpha will be returned. If there is a dip below the alpha this region will be split to two elements.

Let's examine the head of the first QTL:

QTL <- getSigRegions(SNPset = df_filt, alpha = 0.01)
head(QTL[[1]])

Output QTL summary

While getSigRegions is useful for examining every SNP within each QTL and perhaps for some downstream analysis, the getQTLTable will summarize those results and can output a CSV by setting export = TRUE and fileName = "MyQTLsummary.csv". We can set method as either "Gprime" or "QTLseq" depending on the type of analysis; "Gprime" will use alpha as FDR threshold and "QTLseq" will use the interval parameter, which should match one of the intervals calculated above.

Here is the summary for significant regions with a FDR of 0.01:

results <- getQTLTable(SNPset = df_filt, method = "Gprime",alpha = 0.01, export = FALSE)
results

The columns are:

Summary

We've reviewed how to load SNP data from GATK and filter the data to contain high confidence SNPs. We then performed $\Delta (SNP\text{-}index)$ and $G'$ analysis and calculate p-values and q-values based on the tricube-smoothed $G'$ values. The QTL regions that pass our defined threshold can be stored as a list for further analysis or summarized as a table for publication.

Session info

sessionInfo()


bmansfeld/QTLseqr documentation built on Jan. 24, 2020, 3:56 p.m.