Implementation of Bayesian mixture models for copy number estimation

Introduction

CNPBayes models multi-modal densities via a hierarchical Bayesian Gaussian mixture model. The major application of this model is the estimation of copy number at copy number polymorphic loci (CNPs). As described in the overview, four versions of the mixture model are implemented: SB, MB, SBP, and MBP. In each case, approximation of the posterior is by Markov Chain Monte Carlo (MCMC) written in C++ using the Rcpp package [@Rcpp].

For an EM-implementation of Gaussian mixture models for CNPs, see the Bioconductor package CNVtools [@Barnes2008]. A Bayesian extension of this model by some of the same authors was developed to automate the analysis of the Welcome Trust Case Control Consortium (WTCCC) genotype data [@cardin] and implemented in the R package CNVCALL (http://niallcardin.com/CNVCALL).

The key distinction of this package is an emphasis on exploring models that account for batch effects -- differences between groups of samples that were processed at different times and/or by different labs. For smaller studies, the more standard `SingleBatchModel may be reasonable. This package provides an infrastructure for fitting all of these models and then comparing these models by marginal likelihoods or Bayes factors. The main applications we envision these models for are in studies of germline diseases containing thousands of samples and several hundred CNPs. We suggest parallelizing by CNP (e.g., devoting a single core to each CNP), though leave the details on parallelization to the user.

library(CNPBayes)
library(ggplot2)

Implementation

CNPBayes uses several S4 classes to encapsulate key parameters of the MCMC simulation, reducing the need for functions with a large number of arguments and providing an explicit contract for the arguments passed to the C++ back-end. The three core classes are

mp <- McmcParams(iter=1000, burnin=100, thin=1, nStarts=4)

Hyperparameters

Hyperparameters for the mixture model are specified as an instance of class Hyperparameters. Hyperparameters include:

Instances of the hyperparameters class can be created by a constructor with the same name of the class. For example,

  Hyperparameters()
  HyperparametersMultiBatch()

hpList is a convenience function for creating a list of hyperparameter objects appropriate for each of the four types of models:

hp.list <- hpList()
names(hp.list)
hp.list[["SB"]] ## hyperparameters for SB model
hp.list <- hpList(k=3) ## each object will have hyperparameters for a 3-component mixture model
k(hp.list[["SB"]])

A graphical overview of the model.

library(grid)
bayesianGrob <- function(x, r=unit(0.25, "inches")){
  tg <- textGrob(x)
  cg <- circleGrob(r=r)
  boxedText <- gTree(children=gList(tg, cg))
}

grobY <- bayesianGrob(expression(y[i]))
grobThetas <- bayesianGrob(expression(theta[h]))
grobSigma2 <- bayesianGrob(expression(sigma[h]^2))
grobH <- bayesianGrob(expression(z[i]))
grobNu0 <- bayesianGrob(expression(nu[0]))
grobSigma02 <- bayesianGrob(expression(sigma[0]^2))
grobPi <- bayesianGrob(expression(pi[h]))
grobMu <- bayesianGrob(expression(mu[h]))
grobTau2 <- bayesianGrob(expression(tau[h]^2))

h <- unit(0.25, "inches")
e <- unit(0.05, "inches")
d <- unit(0.025, "inches")

ar <- arrow(ends="last", length=unit(0.075, "inches"), type="closed")
grid.newpage()
y.x <- 0.5; y.y <- 0.1
h.x <- 0.1; h.y <- 0.3
theta.x <- 0.5; theta.y <- 0.3
sigma2.x <- 0.7; sigma2.y <- 0.3
pi.x <- 0.1; pi.y <- 0.5
mu.x <- 0.45; mu.y <- 0.5
tau2.x <- 0.55; tau2.y <- 0.5
nu0.x <- 0.65; nu0.y <- 0.5
s20.x <- 0.75; s20.y <- 0.5
grid.draw(editGrob(grobY, vp=viewport(y.x, y.y), gp=gpar(fill="gray")))
grid.draw(editGrob(grobY, vp=viewport(y.x, y.y), gp=gpar(fill="transparent")))
grid.draw(editGrob(grobH, vp=viewport(h.x, h.y)))

grid.draw(editGrob(grobThetas, vp=viewport(theta.x, theta.y)))
grid.draw(editGrob(grobSigma2, vp=viewport(sigma2.x, sigma2.y)))

## theta -> y
grid.move.to(unit(theta.x, "npc"), unit(theta.y, "npc") - h)
grid.line.to(unit(y.x, "npc"), unit(y.y, "npc") + h, arrow=ar,
             gp=gpar(fill="black"))
## sigma2 -> y
grid.move.to(unit(sigma2.x, "npc") - e, unit(sigma2.y, "npc") -h)
grid.line.to(unit(y.x, "npc") + h, unit(y.y, "npc") + h, arrow=ar,
             gp=gpar(fill="black"))

## h -> theta
grid.move.to(unit(h.x, "npc") + h, unit(h.y, "npc") - h)
grid.line.to(unit(y.x, "npc") - h, unit(y.y, "npc") + h,  arrow=ar,
             gp=gpar(fill="black"))

##pi
grid.draw(editGrob(grobPi, vp=viewport(pi.x, pi.y)))
## pi -> h
grid.move.to(x=unit(pi.x, "npc"), y=unit(pi.y, "npc") - h)
grid.line.to(x=unit(h.x, "npc"),
             y=unit(h.y, "npc")+h, arrow=ar,
             gp=gpar(fill="black"))


## mu_h
grid.draw(editGrob(grobMu, vp=viewport(mu.x, mu.y)))
## mu_h -> theta
grid.move.to(x=unit(mu.x, "npc")+e, y=unit(mu.y, "npc") - h)
grid.line.to(x=unit(theta.x, "npc")-e, y=unit(theta.y, "npc")+h, arrow=ar,
             gp=gpar(fill="black"))


## sigma2_h
grid.draw(editGrob(grobTau2, vp=viewport(tau2.x, tau2.y)))
## sigma2_h -> theta_h
grid.move.to(x=unit(tau2.x, "npc")-e, y=unit(tau2.y, "npc") - h)
grid.line.to(x=unit(theta.x, "npc")+e, y=unit(theta.y, "npc")+h, arrow=ar,
             gp=gpar(fill="black"))

grid.draw(editGrob(grobNu0, vp=viewport(nu0.x, nu0.y)))
## nu_0 -> sigma2_0
grid.move.to(x=unit(nu0.x, "npc")+e, y=unit(nu0.y, "npc") - h)
grid.line.to(x=unit(sigma2.x, "npc")-e, y=unit(sigma2.y, "npc")+h, arrow=ar,
             gp=gpar(fill="black"))

grid.draw(editGrob(grobSigma02, vp=viewport(s20.x, s20.y)))
## nu_0 -> sigma2_0
grid.move.to(x=unit(s20.x, "npc")-e, y=unit(s20.y, "npc") - h)
grid.line.to(x=unit(sigma2.x, "npc")+e, y=unit(sigma2.y, "npc")+h, arrow=ar,
             gp=gpar(fill="black"))

Simple example

CNPBayes allows for the simulation of test data. The number of observations, mixture proportions, means for each component, and standard deviations for each component must be specified.

set.seed(123)
sim.data <- simulateData(N=1000, p=rep(1/3, 3),
                         theta=c(-1, 0, 1),
                         sds=rep(0.1, 3))

The workhorse function in this package is gibbs. This function allows the user to specify one or all of the four possible mixture model types and for each type fit models with mixture components specified by the argument k_range. Below, we run 4 chains for 1000 iterations (100 iterations burnin) for only the k=3 SB model. (In practice, we would fit multiple models and specify a range for k -- e.g, k_range=c(1, 5)). As described in the convergence vignette, an attempt is made by gibbs to adapt the thinning and burnin parameters to attain convergence. The models evaluated by gibbs are returned in a list where all chains have been combined and the models are sorted by decreasing value of the marginal likelihood. In order to properly assess convergence, this function requires that one run at least 2 independent chains.

## Create McmcParams for batch model
mp <- McmcParams(iter=600, burnin=100, thin=1, nStarts=4)
model.list <- gibbs(model="SB", dat=y(sim.data),
                    k_range=c(3, 3), mp=mp)

We use Chib's estimator [@chib], running a reduced Gibbs sample to estimate the marginal likelihood of each model. Alternative methods for selection of k are also included. As an example, the bic method can be used for calculating the Bayesian Information Criterion. As the data is simulated from a three component distribution, SB3 should be chosen.

round(sapply(model.list, marginal_lik), 1)
names(model.list)[1]

For comparing models of the same type, one could also use the BIC.

round(sapply(model.list, bic), 1)

Copy number

The mixture components may not necessarily correspond to distinct copy number states. For example, if the number of probes in a CNP region is small, the average log R ratio calculated for each sample is more likely to be skewed. In practice, one would fit many models (e.g., k = 1 - 4), select the best model by the marginal likelihood, and then map components from the best model to distinct copy number states. Mapping mixture components to copy number states can be done manually using the mapping<- function. The constructor CopyNumberModel will automatically create a mapping based on the overlap of the component densities. By mapping mixture components to copy number states, the posterior probability can be computed for each copy number state as opposed to the mixture components.

model <- model.list[[1]]
cn.model <- CopyNumberModel(model)
mapping(cn.model)

The above mapping (1 2 3) proposes a one-to-one mapping between the 3 mixture components and the copy number states. Visualization of the model-based density estimates overlaying the data can be helpful for assessing the adequacy of this particular model:

ggMixture(cn.model)

As our interest is in the posterior probability of the copy number state (not the posterior probability of the mixture components), these probabilities can be computed from a copy number model by the function probCopyNumber.

head(probCopyNumber(cn.model))

One can explicitly set the mapping using the mapping<- replacement method. For example, here we map the first 2 components to the same state and the probability matrix now has only two columns:

cn.model2 <- cn.model
mapping(cn.model2) <- c("1", "1", "2")
head(probCopyNumber(cn.model2))

Again, ggMixture can be helpful for assessing the model. Here, mapping the first 2 components to the same state would only be reasonable if the differences between the first 2 components could be explained by batch effects. To evaluate this, one would want to identify the batches (next section) and fit the MB or MBP models in the gibbs function illustrated above.

ggMixture(cn.model2)

Batch Estimation

Differences in the one-dimensional summaries between groups of samples can arise due to technical sources of variation referred to as batch effects. For example, date, temperature, machine calibration, and lab technician are all potentially important technical sources of variation [@batch-effect]. Batch effects can vary by locus, with some loci more susceptible to technical factors that may affect measurement. The 96 well chemistry plate on which the samples were processed is often a useful surrogate for batch effects since this tends to capture samples that were processed (e.g., PCR amplification) and scanned at approximately the same time. However, in large studies involving potentially hundreds of chemistry plates, it is not computationally feasible to fit fully Bayesian plate-specific mixture models. Because chemistry plates are often processed at similar times and may be comparable in terms of the distribution of a statistic of interest, it may be more useful (and computationally scalable) to batch the chemistry plates. Here, we implement a simple two-sample Kolmogorov-Smirnov (KS) test for each pairwise combination of chemistry plates in the function collapseBatch. The two-sample KS test is applied recursively until no two combinations of grouped plates can be combined. We illustrate with a trivial example.

k <- 3
nbatch <- 3
means <- matrix(c(-1.2, -1.0, -1.0,
                -0.2, 0, 0,
                0.8, 1, 1), nbatch, k, byrow=FALSE)
sds <- matrix(0.1, nbatch, k)
N <- 500
sim.data <- simulateBatchData(N=N,
                              batch=rep(letters[1:3], length.out=N),
                              theta=means,
                              sds=sds,
                              p=c(1/5, 1/3, 1-1/3-1/5))

Again, one could apply the gibbs function to fit this data. Since we know the truth (same variance for each of the 3 mixture component and multiple batches), we could simply run the pooled variance multi-batch model (MBP) with 3 components:

gibbs(model="MBP", dat=y(sim.data), k_range=c(3, 3))

In practice, we would evaluate many model and select the best model by the marginal likelihood or the Bayes factor:

model.list <- gibbs(model=c("SBP", "MBP"), k_range=c(1, 5),
                    dat=y(sim.data),
                    batches=batch(sim.data), mp=mp)

References



Try the CNPBayes package in your browser

Any scripts or data that you put into this service are public.

CNPBayes documentation built on May 6, 2019, 4:06 a.m.