# no options set yet
Estimating multiple infections from variant data obtained using massively parallel
sequencing experiments can be done in several ways using moimix
.
The core function in moimix
is binomix
which is an interface to
a binomial mixture model estimation procedure using the R package flexmix
. The input
to this function is a matrix of read counts supporting the reference and alternate alleles
for all the variants in data.
In this vignette, we will introduce some key functions in moimix
and how
to set up your variant data to estimate some measures of multiplicity of infection.
moimix
Function | Description ---------|------------- binommix | estimate MOI using binomial mixture model alleleCounts | extract a matrix of read counts getBAF | estimate B-allele frequencies from read counts plotBAF | plot the B-allele frequency spectra genome-wide getMAF | esimate minor allele frequencies from read counts getFws | estimate within isolate diversity according to Manske et. al, 2012 callMajor | call major haplotypes using B-allele frequencies extractCOIL | Output a COIL barcode file extractPED | Create haploid PLINK PED file
moimix
assumes that variants have been called either from whole-genome/exome or RNA
sequencing and the variant calling algorithm used outputs a variant call format (VCF) version 4.1
file. We also assume that each sample in the VCF file has been annotated with a AD tag
at the variant sites for both the alternate and reference alleles. Currently, we have tested the functions using output from GATK UnifiedGenotyper and HaplotypeCaller algorithms.
moimix
requires that the VCF file is then converted to the Genomic Data Storage (GDS) format. The GDS format provides a fast and convienent way of manipulating large VCF files in R. We make use of the SeqArray
package
for both VCF conversion and manipulating GDS files.
If you are just interested in running the binomix procedure, then you do not need
to have a GDS file. For each sample construct a matrix with 2 columns
correpsonding the read counts for the alternate and referene alleles. Then pass this matrix
to the binomix
function.
To convert a tabixed gzipped VCF file to the GDS format use the SeqArray
package.
library(SeqArray) seqVCF2GDS("/path/to/my/file.vcf.gz", "/path/to/my/file.gds")
To read the GDS file into R, use the seqOpen
function and to check that its
output matches the orginal VCF file use seqSummary.
my_vcf <-seqOpen("/path/to/my/file.gds") seqSummary(my_vcf)
Here we present an example analysis from data obtained from the Malaria Genomics Consoritum community project. The data consists of 3 Plasmodium falciparum isolates from The Gambia with approximately 60,000 high-quality biallelic SNVs on 14 nuclear chromosomes and the apicoplast.
For details on how the data was processed please see the README file located at ftp://ngs.sanger.ac.uk/production/pf3k/release_4/
First, we use SeqArray
to load our GDS file and look at it.
library(SeqArray) gds_file <- system.file("extdata", "gambian_isolates.gds", package = "moimix") isolates <- seqOpen(gds_file) seqSummary(isolates) # save sample identifiers sample.id <- seqGetData(isolates, "sample.id")
We can look at the number of variants on each of the main nuclear chromosomes.
library(moimix) # get genomic coordinates of all variants coords <- getCoordinates(isolates) head(coords) table(coords$chromosome)
Furthermore, we can extract high quality ('true positive') SNPs using the seqGetData
function and obtaining filters based on the GATK best practices.
Here we look at the distributions of the QD and SOR tags in the original VCF file.
# extract tags from GATK Haplotype Caller using convienence funciton qd <- seqGetData(isolates, "annotation/info/QD") sor <- seqGetData(isolates, "annotation/info/SOR") hist(qd) hist(sor)
The first step in our clonality analysis is to estiamte the B-allele frequencies
for each isolate directly from the sequencing depth.
We construct a bafMatrix
object which stores the estimates of the B-allele
frequencies in a matrix over the entire cohort and the genomic coordinates
of the variants.
isolate_baf <- bafMatrix(isolates) class(isolate_baf) str(isolate_baf)
You can plot the genome-wide B-allele frequencies by passing the bafMatrix
object
and a sample identifer to plot
.
Here's an example for an isolate that is from a single-clone infection.
plot(isolate_baf, sample.id = "PA0022-C")
And an example of multiple-clone infection.
plot(isolate_baf, sample.id = "PA0021-C")
And a slighly more complicated example.
plot(isolate_baf, sample.id = "PA0008-C")
binommix
To estimate MOI we use the binomix
function. MOI is estimated on a per-isolate
basis using a binomial mixture model implemented in the R package flexmix
.
In order to fit a model for an isolate, we need a matrix of read counts and to
set the value of $k=2c$ where $c$ is the number clonal infections.
We fit a mixture model assuming there are 2 infections on "PA0021-C" and report the results.
set.seed(2002) # first generate matrix of read counts for ref and alt alleles counts <- alleleCounts(isolates) # fit a model on sample PA0021-C, multiple clone infection m1 <- binommix(counts, sample.id = "PA0021-C", k = 2) # to inspect the model and view parameter estimates param_estimates <- getTheta(m1$fits)
The estimated clonal proportions under the binomial mixture model are:
param_estimates
We can visualise the model output as follows:
plot(isolate_baf, "PA0021-C") abline(h = param_estimates$mu.hat, col = "blue", lty = 2)
Or by plotting a histogram.
bf <- isolate_baf$baf_matrix["PA0021-C",] hist(bf, breaks = 20, main = "MOI = 2 infection", xlab = "SNV frequency") abline(v = param_estimates$mu.hat, col = "blue", lty = 2)
We can also examine the more complicated mixed infection, which appears to be a mixture of 3 or 4 clones. We can use the Bayesian Information Criterion to select the number of components.
m2 <- binommix(counts, sample.id = "PA0008-C", k = 2:4) plot(isolate_baf, "PA0008-C") abline(h = getTheta(m2$fits, criterion = "BIC")$mu.hat, col = "blue", lty = 2) hist(m2$baf, breaks = 20, main = "Complex Infection", xlab = "SNV frequency" ) abline(v = getTheta(m2$fits, criterion = "BIC")$mu.hat, col = "blue", lty = 2)
In this case we end up with 4 components in our model - with estimated clonal proportions as
r paste0("(", paste(round(getTheta(m2$fits, criterion = "BIC")$pi.hat, 2), collapse = ","), ")")
.
An alternative way of assessing clonality is to estimte the $F_{ws}$ statistic. This is useful if you are only interested in identifying clonal infections for filtering in downstream analyses. An $F_{ws} < 0.99$ is indicative of a clonal infection.
fws_all <- getFws(isolates) fws_all seqClose(isolates)
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.