R/01-likelihood.R In DeepCNV: Exploiting Normal Contamination to Infer Copy Number from Deep Sequencing

# Copyright (C) Kevin R. Coombes 2014-2017

###################################
### To understand what's going on, we need to briefly explain the whole model.
###
### The observed data is considered to be a collection of triples of the form
###     D_i = (K_i, N_i, V_i)
### where for the observed variant i, K_i is the number of variant reads,
### N_i is the total number of reads, and V_i is the "observed variant type",
### which is a binary factor with values in the set {SNP, Mutation}.
### We write D = (D_i) for the collection of all the data.
###
### The two global parameters that we care about are Theta = (nu, S),
### where nu in [0, 1] is the fraction of normal cells and S in {N, D, G}
### is the copy number state. (We simple-mindedly assume that there are
### no subclones to worry about, and we never gain or delete more than one
### copy of anything.)
###
### There are also a collection of nuisance parameters (G_i, T_i) where
### G_i is the true genotype-pair (one for normal, one for tumor) at a variant
### and T_i in {SNP, Mutation, LOH, BackMut} is the true variant type.
###
### By Bayes Rule, the posterior will be given by
###     P(Theta | D) = P(D | Theta) P(Theta) / P(D)
### We can compute the denominator as
###     P(D) = \sum_theta P(D | Theta = theta) P(Theta = theta)
### To implement this step, we will (eventually) just normalize everything
### by brute force.  Before that, however, we have to specify the likelihood
### model and explain how to compute it. We have
###     P(D | Theta) = \prod_i P(D_i | Theta)
### This formula follows from the assumption that the variants are independent
### sources of information. Now we can drop the subscript "i" and just look at
### the likelihood for the data froma single variant. We have
###     P(D | Theta) = \sum_t P(D, T=t | Theta) = \sum_t P(D | T=t, Theta) P(T=t | Theta).
### Note that this formula comes from integrating over the (distinct) possible
### values for the true variant state. We can also integrate over the possible
### genotypes to get
###     P(D | Theta) = \sum_t \sum_g P(D, G=g | T=t, Theta) P(T=t | Theta).
###         = \sum_t \sum_g P(D | G=g, T=t, Theta) P(G=g | T=t, Theta) P(T=t | Theta).
### Now we look at the three pieces we need to compute for that last part.
### First, we have
###     P(t=t | Theta) = P(T=t)
### since the global parameters are irrelevant to the true variant type. Since T is
### a discrete random variable, we just need to specify the distribution used for this
### mmodel. (SNPs are common; mutations are rare.)
### Second, we have
###     P(G=g, | T=t, Theta) = P(G | T, nu, S) = P(G | T, S)
### since the normal fraction nu tells us nothing about the true genotypes.
### This is another discrete distribution that we can specify, but it is more
### compiicated since there are four variant-types (T), three copy number states (S),
### and lots of possible genotypes (maybe as many as 24, but at least 18).
### Finally, we have
###     P(D | G, T, Theta) = P(D | G, T, nu, S) = P(K, N, V | G, T, nu, S)
###         = P(K, N | G, nu) P(V | T)
### since we model the read-counts (K, N) as independent of the observed variant
### type V. The term P(V | T) is yet another discrete distribution. The final piece,
### hoever, is modeled by a biomial distribution
###     P(K, N | G, nu) = Binom(K, N, phi)
### where phi = Phi(G, nu) is completely computable from the genoptype and the
### normal fraction.
###
### The prior on the global  parameters can be specified as
###     P(Theta) = P(nu) P(S)
### where
###     P(nu) is a beta distribution and
###     P(S) is a discrete distribuiton over the three copy number states.

###################################
### These are the normal-tumor genotype pairs that we consider
makeGenotypes <- function() {
### create the normal-tumor genotype pairs
genotypes <- function(M, N) {
nl <- sapply(0:M, function(m) {
paste(rep(c("R", "V"), times=c(M-m, m)), collapse="")
})
tu <- sapply(0:N, function(m) {
paste(rep(c("R", "V"), times=c(N-m, m)), collapse="")
})
data.frame(Normal=rep(nl, each=M), Tumor=rep(tu, times=N))
}
### only consider gains or deletions of one copy
g22 <- genotypes(2,2)
g21 <- genotypes(2,1)
g23 <- genotypes(2,3)
gt <- rbind(g22, g21, g23)
### we can compute the copy number state from the genotype
gt$S <- factor(c("Deleted", "Normal", "Gained")[sapply(as.character(gt$Tumor), nchar)])
### with work, we can also compute the variant type
### start by coiunting the total number of refernce (R) and variant (V) alleles
av <- sapply(strsplit(paste(as.character(gt$Normal), as.character(gt$Tumor), sep=""), ""),
function(x) sum(x=="V"))
ar <- sapply(strsplit(paste(as.character(gt$Normal), as.character(gt$Tumor), sep=""), ""),
function(x) sum(x=="R"))
### if all the reads are the same at this base, we never see it...
strange <- gt[av*ar==0,]
### now have at least one referencde and one variant allele
gt <- gt[av > 0 & ar > 0,]
### count the number R's and V'2s in both nromal (N) and tumor (T)
NR <- sapply(strsplit(as.character(gt$Normal), ""), function(x) sum(x=="R")) NV <- sapply(strsplit(as.character(gt$Normal), ""), function(x) sum(x=="V"))
TR <- sapply(strsplit(as.character(gt$Tumor), ""), function(x) sum(x=="R")) TV <- sapply(strsplit(as.character(gt$Tumor), ""), function(x) sum(x=="V"))
### define te four variant types
mut <- NV == 0 & TV > 0
bak <- NR == 0 & TR > 0
loh <- !mut & !bak & (TR==0 | TV==0) & gt$S != "Deleted" snp <- !mut & !bak & !loh ### assemble them into a four-valued factor foo <- rep("BackMut", nrow(gt)) foo[mut] <- "Mutation" foo[loh] <- "LOH" foo[snp] <- "SNP" gt$V <- factor(foo)
rownames(gt) <- paste(as.character(gt$Normal), as.character(gt$Tumor), sep=".")
gt
}

basecase <- makeGenotypes()

cnSet <- levels(basecase$S) # copy number status vtSet <- levels(basecase$V)                  # true variant type

###################################
### Internal models
### KRC: Should we allow users to modify these (pT or pVgT or pGgTS)?

### P(T=t)
### SNPs are common; mutations are rare.
pT <- c("BackMut"=0.001, "LOH"=0.01, "Mutation"=0.1, "SNP"=0.889)

### P(V | T)
### Most of the time, the reported variant type is likely to be correct.
### We're more likely to make the mistake of calling a Mutation a SNP than
### the other way around because of corruption or misunderstandings in the
### dbSNP database.
pVgT <- matrix(c(0.1, 0.9, 0.02, 0.98, 0.99, 0.01, 0.02, 0.98), nrow=2, ncol=4)
rownames(pVgT) <- c("Mutation", "SNP")
colnames(pVgT) <- c("BackMut", "LOH", "Mutation", "SNP")

### P(G | T, S)
### We compute this out of the 'basecase' object. We make the assumption that
### all genotype combinations that exist in 'basecase' are conditionally equally
### likely.
makeDiscretePrior <- function() {
pGgTS <- array(0,
dim=c(nrow(basecase), length(cnSet), length(vtSet)),
dimnames = list(NULL, cnSet, vtSet))
for ( cn in cnSet ) {
for (vt in vtSet ) {
w <- which(basecase$S == cn & basecase$V == vt)
L <- length(w)
if (L > 0) {
pGgTS[w, cn,vt] <- 1/L
}
}
}
pGgTS
}
pGgTS <- makeDiscretePrior()

###################################
### key part of the likelihood model is to compute the expected fraction of
### variant reads as a function of the genotype and the normal fraction, nu.
### We do this by producing functions of from the given genotypes.
CNVariant <- function(gNormal, gTumor) {
nl <- strsplit(gNormal, "")[[1]]
rn <- sum(nl == "R")
vn <- sum(nl == "V")
tu <- strsplit(gTumor, "")[[1]]
rt <- sum(tu == "R")
vt <- sum(tu == "V")
function(nu) {
(vn*nu + vt*(1 - nu)) / ((rn + vn)*nu + (rt + vt)*(1 - nu))
}
}

phis <- sapply(1:nrow(basecase), function(i) {
gN <- as.character(basecase$Normal[i]) gT <- as.character(basecase$Tumor[i])
CNVariant(gN, gT)
})
names(phis) <- rownames(basecase)

###################################
### actually compute the log-likelihood
cnvLikelihood <- function(nus, obs) {
if (any(nus < 0)) stop("Normal fractions must be non-negative.")
if (any(nus > 1)) stop("Normal fractions must be between 0 and 1.")
# need a three-dimensional array
#  1. grid points over the beta distribution on nu
#  2. copy number choices * variant type choices (basecase)
#  3. number of data rows in 'obs'
hiddenloglike <- array(NA,
dim=c(length(nus), nrow(basecase), nrow(obs)),
dimnames=list(NULL, rownames(basecase), rownames(obs)))
pd <- basecase # hidden discrete genotype states
for (i in 1:nrow(obs)) {   # each data row
K <- obs$K[i] # variant reads N <- obs$N[i] # total reads
for (pj in 1:nrow(pd)) { # each hidden discrete state
phi <- phis[[pj]](nus) # expected success as function of nu
hiddenloglike[,pj,i] <- dbinom(K, N, phi, log=TRUE)
}
}
hiddenloglike
}

###################################
simReads <- function(nMut, nVar, nu, cnstate=cnSet, depth=100, sdepth=25, ...) {
cnstate <- match.arg(cnstate)
bc <- basecase[basecase$S == cnstate,] wv <- c(sample(which(!(bc$V %in% c("BackMut", "Mutation"))), nVar, replace=TRUE),
sample(which(bc$V == "Mutation"), nMut, replace=TRUE)) design <- bc[wv,] rownames(design) <- NULL design$nu <- nu
design$N <- round(rnorm(nrow(design), depth, sdepth)) nVars <- apply(design, 1, function(arow) { phi <- CNVariant(as.character(arow[1]), as.character(arow[2])) rbinom(1, as.numeric(arow[6]), phi(nu)) }) design$K <- nVars
v <- as.character(design$V) v[v!="Mutation"] <- "SNP" design$Valt <- design$V design$V <- factor(v)
design
}


Try the DeepCNV package in your browser

Any scripts or data that you put into this service are public.

DeepCNV documentation built on May 2, 2019, 5:23 p.m.