In sa-lee/moimix: Estimating mulitplicity of infection from high-throughput sequencing data

# no options set yet

Estimating multiple infections from variant data obtained using massively parallel sequencing experiments can be done in several ways using moimix.

The core function in moimix is binomix which is an interface to a binomial mixture model estimation procedure using the R package flexmix. The input to this function is a matrix of read counts supporting the reference and alternate alleles for all the variants in data.

In this vignette, we will introduce some key functions in moimix and how to set up your variant data to estimate some measures of multiplicity of infection.

Key functions in `moimix`

Function | Description ---------|------------- binommix | estimate MOI using binomial mixture model alleleCounts | extract a matrix of read counts getBAF | estimate B-allele frequencies from read counts plotBAF | plot the B-allele frequency spectra genome-wide getMAF | esimate minor allele frequencies from read counts getFws | estimate within isolate diversity according to Manske et. al, 2012 callMajor | call major haplotypes using B-allele frequencies extractCOIL | Output a COIL barcode file extractPED | Create haploid PLINK PED file

Data Formats

moimix assumes that variants have been called either from whole-genome/exome or RNA sequencing and the variant calling algorithm used outputs a variant call format (VCF) version 4.1 file. We also assume that each sample in the VCF file has been annotated with a AD tag at the variant sites for both the alternate and reference alleles. Currently, we have tested the functions using output from GATK UnifiedGenotyper and HaplotypeCaller algorithms.

moimix requires that the VCF file is then converted to the Genomic Data Storage (GDS) format. The GDS format provides a fast and convienent way of manipulating large VCF files in R. We make use of the SeqArray package for both VCF conversion and manipulating GDS files.

If you are just interested in running the binomix procedure, then you do not need to have a GDS file. For each sample construct a matrix with 2 columns correpsonding the read counts for the alternate and referene alleles. Then pass this matrix to the binomix function.

Converting a VCF to GDS format

To convert a tabixed gzipped VCF file to the GDS format use the SeqArray package.

library(SeqArray)
seqVCF2GDS("/path/to/my/file.vcf.gz", "/path/to/my/file.gds")

To read the GDS file into R, use the seqOpen function and to check that its output matches the orginal VCF file use seqSummary.

my_vcf <-seqOpen("/path/to/my/file.gds")
seqSummary(my_vcf)

Case study: Plasmodium falciparum isolates from The Gambia.

Here we present an example analysis from data obtained from the Malaria Genomics Consoritum community project. The data consists of 3 Plasmodium falciparum isolates from The Gambia with approximately 60,000 high-quality biallelic SNVs on 14 nuclear chromosomes and the apicoplast.

For details on how the data was processed please see the README file located at ftp://ngs.sanger.ac.uk/production/pf3k/release_4/

First, we use SeqArray to load our GDS file and look at it.

library(SeqArray)
gds_file <- system.file("extdata", "gambian_isolates.gds", package = "moimix")
isolates <- seqOpen(gds_file)



seqSummary(isolates)
# save sample identifiers
sample.id <- seqGetData(isolates, "sample.id")

We can look at the number of variants on each of the main nuclear chromosomes.

library(moimix)
# get genomic coordinates of all variants
coords <- getCoordinates(isolates)
head(coords)

table(coords$chromosome)

Furthermore, we can extract high quality ('true positive') SNPs using the seqGetData function and obtaining filters based on the GATK best practices. Here we look at the distributions of the QD and SOR tags in the original VCF file.

# extract tags from GATK Haplotype Caller using convienence funciton
qd <- seqGetData(isolates, "annotation/info/QD")
sor <- seqGetData(isolates, "annotation/info/SOR") 

hist(qd)
hist(sor)

Estimating the BAF matrix

The first step in our clonality analysis is to estiamte the B-allele frequencies for each isolate directly from the sequencing depth. We construct a bafMatrix object which stores the estimates of the B-allele frequencies in a matrix over the entire cohort and the genomic coordinates of the variants.

isolate_baf <- bafMatrix(isolates)
class(isolate_baf)
str(isolate_baf)

You can plot the genome-wide B-allele frequencies by passing the bafMatrix object and a sample identifer to plot.

Here's an example for an isolate that is from a single-clone infection.

plot(isolate_baf, sample.id = "PA0022-C")

And an example of multiple-clone infection.

plot(isolate_baf,  sample.id = "PA0021-C")

And a slighly more complicated example.

plot(isolate_baf,  sample.id = "PA0008-C")

Estimating MOI with `binommix`

To estimate MOI we use the binomix function. MOI is estimated on a per-isolate basis using a binomial mixture model implemented in the R package flexmix. In order to fit a model for an isolate, we need a matrix of read counts and to set the value of $k=2c$ where $c$ is the number clonal infections.

We fit a mixture model assuming there are 2 infections on "PA0021-C" and report the results.

set.seed(2002)
# first generate matrix of read counts for ref and alt alleles
counts <- alleleCounts(isolates)

# fit a model on sample PA0021-C, multiple clone infection
m1 <- binommix(counts, sample.id = "PA0021-C", k = 2)
# to inspect the model and view parameter estimates
param_estimates <- getTheta(m1$fits)

The estimated clonal proportions under the binomial mixture model are:

param_estimates

We can visualise the model output as follows:

plot(isolate_baf, "PA0021-C")
abline(h = param_estimates$mu.hat, col = "blue", lty = 2)

Or by plotting a histogram.

bf <- isolate_baf$baf_matrix["PA0021-C",]
hist(bf, breaks = 20, main = "MOI = 2 infection", xlab = "SNV frequency")
abline(v = param_estimates$mu.hat, col = "blue", lty = 2)

We can also examine the more complicated mixed infection, which appears to be a mixture of 3 or 4 clones. We can use the Bayesian Information Criterion to select the number of components.

m2 <- binommix(counts, sample.id = "PA0008-C", k = 2:4)

plot(isolate_baf, "PA0008-C")
abline(h = getTheta(m2$fits, criterion = "BIC")$mu.hat, col = "blue", lty = 2)

hist(m2$baf, breaks = 20, main = "Complex Infection", xlab = "SNV frequency" )
abline(v = getTheta(m2$fits, criterion = "BIC")$mu.hat, col = "blue", lty = 2)

In this case we end up with 4 components in our model - with estimated clonal proportions as r paste0("(", paste(round(getTheta(m2$fits, criterion = "BIC")$pi.hat, 2), collapse = ","), ")").

Estimating MOI with $F_{ws}$

An alternative way of assessing clonality is to estimte the $F_{ws}$ statistic. This is useful if you are only interested in identifying clonal infections for filtering in downstream analyses. An $F_{ws} < 0.99$ is indicative of a clonal infection.

fws_all <- getFws(isolates)

fws_all

seqClose(isolates)

References

flexmix package - https://cran.r-project.org/web/packages/flexmix/index.html
SeqArray package - http://bioconductor.org/packages/release/bioc/html/SeqArray.html