knitr::opts_chunk$set( out.width = '80%', dpi = 300, collapse = TRUE, comment = "#>" )
An implementation of Approximate Bayesian Computation (ABC) methods applied to pooled sequencing (Pool-seq) data is available in R language in the package poolABC
. The purpose of this vignette is to provide an in-depth overview of the capabilities of the package, highlighting the usage of its main functions.
library(poolABC)
The initial sections of this vignette detail how to import pooled sequencing data from files in the _rc
format. We also show how to randomly select multiple subsets of the observed data, compute summary statistics for those subsets and use those summary statistics as target for parameter estimation and model selection with ABC.
This vignette also teaches users how to simulate Pool-seq data under pre-defined models. Note that the simulation of Pool-seq data requires functions included in the poolHelper
package. We then exemplify how the imported data and the simulated data can be used to perform parameter inference. The poolABC
package allows the use of genome-wide multilocus data for ABC by using multiple subsets of simulated and observed loci.
Briefly, we obtain one set of summary statistics for each set of simulated loci and for each random subset of observed data. Each set of summary statistics computed from a unique subset of observed data is used as an independent target for parameter estimation. Thus, with the poolABC
package, users obtain one posterior distribution, for each parameter of interest and for each subset of observed loci. Then, our package allows users to combine those multiple posteriors to obtain a single estimate per parameter. This merging is performed with the Epanechnikov kernel and weighting according to the distance between the mean summary statistics of a subset of loci and the mean across the genome, giving more weight to subsets of loci with a mean closer to the overall mean. Finally, the poolABC
package also includes functions to compute point estimates and produce plots of those merged posterior distributions.
This package uses pooled sequencing data stored on _rc
format files. These _rc
files are created by running the SNP-frequency-diff.pl
function of popoolation2
. Briefly, this is an example of a typical _rc
file with only two populations:
data.frame(chr = c("NC297", "NC297"), pos =c(3530, 5450), rc = c("A", "T"), allele_count = 2, allele_states = c("A/G", "T/A"), deletion_sum = 0, snp_type = "pop", major_alleles = c("AA", "TT"), minor_alleles = c("GG", "AN"), maa_1 = c("54/55", "51/54"), maa_2 = c("76/78", "96/96"), mia_1 = c("1/55", "3/54"), mia_2 = c("2/78", "0/96"))
More information about these files can be found at: https://sourceforge.net/p/popoolation2/wiki/Main/
If you have your data in _rc
files in a folder of your computer, you can simply use the importContigs
function.
# load multiple files and organize information by contig files <- importContigs(path = "/home/myawesomedata", pops = c(1, 4, 7, 10))
The path
input of this function indicates the path to the folder where the _rc
files are located. By default, the importContigs
function will import all files present in the folder that include the _rc
pattern in their name. The index of populations to import is defined by the pops
input argument. For instance, the input pops = c(1, 4, 7, 10)
will import the major and minor allele for the first, fourth, seventh and tenth population in the _rc
files.
The importContigs
function also includes several optional input arguments. The files
input argument allows you to specify the index of the files to import. For instance, by setting files = 1:5
, only the first five files listed in the output of list.files(path = path)
will be imported. Additionally, specific contigs can be removed from the data by adding their names to the remove
input argument.
The min.minor
input argument allows you to filter the data by the number of minor allele reads. For instance, if min.minor = 2
all sites where the total number of minor allele reads across all populations of the pops
input argument is below 2, will be removed from the data. Alternatively, by setting filter = TRUE
, you can filter the data by the frequency of the minor allele. When filter = TRUE
, the user can define a threshold
for the minimum allowed frequency of the minor allele. If no threshold
is defined, the importContigs
function will assume that at least one minor allele read per site should exist. Finally, it is possible to include an header when importing the data. This header can be created with the createHeader
function.
Random windows of a given size (in base pairs) can be selected from the imported data with the pickWindows
function. The data imported with the importContigs
function is a list with all the elements required for the pickWindows
function. You can assign each of those list elements to an individual object or use them directly as input argument of the pickWindows
function.
# randomly select blocks of a given size from several contigs blocks <- pickWindows(freqs, positions, range, rMajor, rMinor, coverage, window = 1000, nLoci = 100)
With this function, users can randomly select a subset of the complete pooled sequencing data at their disposal. More specifically, the pickWindows
function allows users to randomly select nLoci
blocks of window
size (in base pairs) from the data imported in the previous section. In other words, this function will randomly select select nLoci
contigs and then select one random block with a user defined size (defined by the window
input) per contig.
The next step is the computation of a set of summary statistics from the observed data. To compute summary statistics from the observed data, we can use the statsContig
function. This function will compute the same set of summary statistics used in the simulations from the multiple random subset of loci obtained in the previous step.
# compute a set of observed summary statitstics obs <- statsContig(randomWindows = blocks, nPops = 4, stat.names = stat.names)
Note that we are using the blocks
object created with the pickWindows
function as the randomWindows
input argument. The statsContig
function will compute summary statistics from those randomly selected blocks of observed data. Also, the use of names for the summary statistics is strongly recommended. To ensure that the set of observed summary statistics is named, we should obtain the name of the simulated summary statistics and include those in the stat.names
input argument of the statsContig
function.
With this package, you also have the ability to simulate pooled sequencing data under three different models by using the poolSim
function. The model
input argument allows the user to define which model to simulate. At the moment, this package includes three different models: an isolation with migration model with two populations, a model representing a single origin of two divergent ecotypes and a third model representing a parallel origin of those ecotypes.
To simulate data using the two populations model, you have to define the mean depth of coverage and the variance of the coverage for those two populations. You also need to create a list with the number of individuals per pool and per population. In the next chunk, you can see how to simulate data using this model:
# set the mean and variance of the coverage sMean <- c(84.34, 66.76); sVars <- c(1437.22, 912.43) # create a list containing the information of the pool sizes by population size <- rep(list(rep(10, 10)), 2) # run simulation for a two-populations model sims <- poolSim(model="2pops", nDip=200, nPops=2, nLoci=100, nSites=2000, mutrate=2e-8, size=size, mean=sMean, variance=sVars, minimum=15, maximum=180, min.minor=1, Nref=c(25000, 25000), ratio=c(0.1, 3), pool=c(5, 180), seq=c(0.0001, 0.001), split=c(0, 3), CW=c(1e-13, 1e-3), WC=c(1e-13, 1e-3), bT=c(0, 0.5))
The poolSim
function requires several input arguments, that are explained in detail in the help page of the function. However, note that most of those input arguments define the minimum and maximum values for a variety of relevant parameters. To simulate data using a four populations model:
# set the mean and variance of the coverage sMean <- c(84.34, 66.76, 65.69, 68.83); sVars <- c(1437.22, 912.43, 848.02, 1028.23) # create a list containing the information of the pool sizes by population size <- rep(list(rep(5, 20)), 4) # run simulation for a four-populations model sims <- poolSim(model="Single", nDip=400, nPops=4, nLoci=100, nSites=2000, mutrate=2e-8, size=size, mean=sMean, variance=sVars, minimum=15, maximum=180, min.minor=2, Nref=c(25000, 25000), ratio=c(0.1, 3), pool=c(5, 180), seq=c(0.0001, 0.001), split=c(0, 3), CW=c(1e-13, 1e-3), WC=c(1e-13, 1e-3), CC=c(1e-13, 1e-3), WW=c(1e-13, 1e-3), ANC=c(1e-13, 1e-3), bT=c(0, 0.5), bCW=c(0, 0.5), bWC=c(0, 0.5))
The poolSim
function can be used to perform a single simulation. However, most of the times, you will want to perform thousands of simulations. One way to accomplish this is to use replicate
function together with our poolSim
function. We recommend that you do the following:
# set the mean and variance of the coverage sMean <- c(84.34, 66.76); sVars <- c(1437.22, 912.43) # create a list containing the information of the pool sizes by population size <- rep(list(rep(5, 20)), 2) # define how many simulations to run nSims <- 10 # run one batch of simulations sims <- t(replicate(n = nSims, unlist(poolSim(model="2pops", nDip=200, nPops=2, nLoci=100, nSites=1000, mutrate=2e-8, size=size, mean=sMean, variance=sVars, minimum=20, maximum=185, min.minor=2, Nref=c(25000, 25000), ratio=c(0.1, 3), pool=c(5, 180), seq=c(0.0001, 0.001), split=c(0, 3), CW=c(1e-13, 1e-3), WC=c(1e-13, 1e-3), bT=c(0, 0.5)))))
By using the replicate
function, you can perform multiple simulations. By unlisting and then transposing the output of those simulations, you obtain a matrix where each row corresponds to a different simulation and each column is a different parameter or summary statistic.
The observed summary statistics computed in the previous sections and the simulations performed in the previous one can then be used to perform parameter estimation with Approximate Bayesian Computation (ABC).
We included with this package a small dataset simulated under the two populations model. This includes one matrix (sumstats
) with the summary statistics computed from the simulated data, one matrix (params
) with the simulated parameter values and a final matrix (limits
) with the minimum and maximum value of the prior distribution for each parameter.
# load the data included in the package data("sumstats"); data("params"); data("limits")
The poolABC
package aims at streamlining the process of parameter inference with Pool-seq data. One of the key components of that design is the ABC
function.
By using this function, users can simultaneously perform parameter estimation with ABC for multiple targets. The ABC
function requires the data, imported with the importContigs
function and then uses both the pickWindows
and statsContig
functions to select multiple random subset of loci from the observed data and compute a set of observed summary statistics for each of those subsets. Thus, for each subset of loci we obtain a vector of summary statistics and each vector acts as an independent target for parameter estimation. The ABC
function can be used by doing:
# parameter estimation with ÃBC function myabc <- ABC(nPops = 2, ntrials = 10, freqs = freqs, positions = positions, range = range, rMajor = rMajor, rMinor = rMinor, coverage = coverage, window = 1000, nLoci = 100, limits = limits, params = params, sumstats = sumstats, tol = 0.01, method = "regression")
The ntrials
input argument defines the number of independent targets for parameter estimation. In this example, we are performing parameter inference for 10 different targets. Each of those targets was obtained by computing summary statistics from windows of 1000 base pairs (window = 1000
) from 100 (nLoci = 100
) randomly selected contigs of the observed data.
Note that you should also define the method
and tolerance rate, tol
, to use. The tol
is defined as the percentage of accepted simulation. You should strive to keep a low tolerance rate, to avoid accepting simulations that are too distant from the observed data, but it is also important to avoid very stringent tolerance rates that may lead to few accepted values. A typical value of tol = 0.01
or tol = 0.05
is recommended but you should test different tol
values in the cross-validation analysis (see more about this in subsequent sections).
This package implements two ABC algorithms for constructing the posterior distribution from the accepted simulations: a rejection method and a regression-based correction using a local linear regression. When method
is "rejection", simulations are accepted if the Euclidean distance between the set of summary statistics computed from the simulated data and the target is sufficiently small and these accepted simulations are considered a sample from the posterior distribution. When method
is "regression", an additional step is used to correct for the imperfect match between the summary statistics computed from the simulated data and the summary statistics computed from the observed data. For this reason, we recommend that you select the regression method because it will, most often than not, lead to more precise parameter estimates.
After using the ABC
function to perform parameter estimation with Approximate Bayesian Computation for several targets, we need to merge the multiple posteriors obtained (one for each target) into a single posterior per parameter.
This can be performed with the mergepost
function. One of the required input arguments of this function is the global
input. This input should be a vector with the observed summary statistics computed from the entire dataset. We recommend that you use the pickWindows
function to select a large number of loci and then use that random selection as the input argument of the statsContig
function.
# load multiple files and organize information by contig blocks <- pickWindows(freqs = freqs, positions = positions, range = range, rMajor = rMajor, rMinor = rMinor, coverage = coverage, window = 1000, nLoci = 800) # compute a set of summary statistics from the observed data global <- statsContig(randomWindows = blocks, nPops = 2, stat.names = stat.names)
The global vector can then be used in the mergepost
function. The remaining required input arguments are the matrix with the target
for the parameter inference and the list containing the posteriors (post
) for each target and parameter. It is also possible to include the regression weights in the wtreg
option.
# merge posterior distributions myabc <- mergepost(target = myabc$target, global = global, post = myabc$adjusted, wtreg = myabc$weights)
Briefly, this function will merge the different posteriors into a single one, using different weighting methods for the merging. Details about the various elements of the mergepost
output can be found in the help page of the function. Note that the merge
, weighted
, merge_reg
and weighted_reg
entries contain, for each parameter, a locfit
object, obtained after merging the multiple posteriors using the corresponding method. The merged_stat
, weighted_stat
, merge_reg_stat
and weighted_reg_stat
are the posterior point estimates for the corresponding merging method.
Users can then plot the resulting merged posterior distribution with the plot_weighted
function. You should include your matrix with the simulated parameter values in the prior
input argument of this function to also plot the prior distribution of the chosen parameter. This allows for a comparison, in the same plot, of the prior and posterior shape.
You should include the output of the mergepost
function as the merged_posterior
input argument and a matrix with the limits
of the prior distribution for each parameter. Then, you just need to define which parameter to plot with the index
input argument.
# plot the density estimation of a parameter plot_weighted(prior = params, merged_posterior = myabc, limits = limits, index = 2)
The plot_weighted
function plots the posterior density of the chosen parameter, together with the prior distribution of the same plot.
The poolABC
package also allows users to perform model selection by estimating the posterior model probabilities, comparing two scenarios of ecotype formation: the single and the parallel origin scenario. The modelSelect
function can be used to perform model selection with ABC.
One of the required input arguments of the modelSelect
function is the index
. This is a vector of model indices that should have the same length
as nrow(sumstats)
to indicate to which model a particular row of sumstats
belongs. The remaining input arguments are explained in the help page of the function. As before, you should also define the tolerance rate (tol
) and the method
to use. A tolerance of 0.01 and the "regression" method are recommended.
# create a vecto of model indices index <- c(rep("model1", nrow(sumstats)/2), rep("model2", nrow(sumstats)/2)) # select a random simulation to act as target target <- sumstats[10, ] # perform model selection with ABC mysel <- modelSelect(target = target, index = index, sumstats = sumstats, tol = 0.1, method = "regression") # display the structure of the mysel object str(mysel)
The output of the modelSelect
function is a list with six entries. To quickly view the results of the model selection, you can use the summary_modelSelect
function. This function will provide an easy to read display of the posterior model probabilities and Bayes factors. The only required input argument of the summary_modelSelect
function is the object created with the the modelSelect
function.
# check results of model selection msel <- summary_modelSelect(object = mysel)
As you can see, by running the summary.modelSelect
function we get an output with the proportion of accepted simulation for each model under a rejection method and posterior model probabilities under the regression method. If we print the object itself
# print results of model selection
msel
we can also see the Bayes factors between pairs of models for both the rejection and the regression methods.
A fundamental part of any ABC analysis is the validation of the results obtained in the parameter estimation and model selection steps. The poolABC
package includes tools to perform cross validation for both analysis, computing prediction errors for both parameter inference and model selection.
One important component of this validation process is the calculation of prediction errors for each parameter. This allows us to evaluate the confidence of the estimates and the effect of various point estimates and/or tolerance rates.
To perform a leave-one-out cross validation for ABC, you can use the simulationABC
function. This functions requires the simulated parameter values, params
, the simulated summary statistics, sumstats
and a matrix with the limits
of the prior distributions. You should also define the size of the cross-validation sample, nval
, the tolerance rate, tol
, and the type of ABC algorithm to be applied in the method
input.
# perform an Approximate Bayesian Computation simulation study mycv <- simulationABC(params = params, sumstats = sumstats, limits = limits, nval = 100, tol = 0.01, method = "regression") # display the structure of the mycv object str(mycv, max.level = 2)
The output of the simulationABC
function is a list with three elements. Details about each list element are available in the help page of the function. A quick way to visualize the results of the leave-one-out cross validation is to plot the the cross-validation results.
The poolABC
package includes the plot_errorABC
function to allow this visual evaluation of the quality of the estimation. This function requires as input the output of the simulationABC
function. Additionally, you need to define the ABC algorithm (either "reg" for regression or "rej" for rejection) and which point estimate ("mode", "median" or "mean") to plot. You should also define which parameter to plot, by selecting the corresponding index
.
# plot the prediction errors plot_errorABC(x = mycv, method = "reg", statistic = "median", index = 1)
As you can see, this produces a plot with the true parameter value in x-axis and the estimate value of the parameter in the y-axis. Thus, the closer the points are to the diagonal line, the higher is the quality of the estimation.
It is also possible to evaluate how much confidence we should place in the model selection results by performing a leave-one-out cross validation for model selection with ABC via subsequent calls to the function modelSelect
. Briefly, several simulations from each model are selected to act as validation simulations, while the remaining simulations are used as training simulations. For each validation simulation, the function modelSelect
is called to estimate the posterior model probabilities.
# perform a leave-one-out cross validation for model selection modelSim <- sim_modelSel(index = index, sumstats = sumstats, nval = 100, tol = 0.1) # display the structure of the modelSim object str(modelSim)
The output of this leave-one-out cross validation for model selection is a list with 5 different elements that can be used in the error_modelSel
function to compute the confusion matrix and the mean misclassification probabilities of models.
Users can also define a threshold
for the posterior model probabilities. This threshold
corresponds to the minimum posterior probability of assignment. Thus, a simulation where the posterior probability of any model is below the threshold will not be assigned to a model and will instead be classified as unclear
.
# compute the mean misclassification probabilities mselError <- error_modelSel(object = modelSim)
The error_modelSel
function outputs the confusion matrix and the mean model posterior probabilities obtained with the regression method. It will also output other useful information such as the mean posterior probability of correctly assigned models and the mean posterior probability when each model is not correctly assigned.
For a more visual interpretation of these results, it is also possible to display a barplot of the model misclassification. By using the plot_msel
function we can plot the confusion matrix, either in colour (if color = TRUE
) or in grey (if color = FALSE
).
# display a barplot of model misclassification plot_msel(object = mselError)
Using the output of the error_modelSel
function as the input of the plot_msel
function, we can produce this barplot showing the proportion of simulations classified to any of the models.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.