R/01-likelihood.R
In DeepCNV: Exploiting Normal Contamination to Infer Copy Number from Deep Sequencing

Documented in CNVariant cnvLikelihood simReads

# Copyright (C) Kevin R. Coombes 2014-2017

###################################
### To understand what's going on, we need to briefly explain the whole model.
###
### The observed data is considered to be a collection of triples of the form
###     D_i = (K_i, N_i, V_i)
### where for the observed variant i, K_i is the number of variant reads,
### N_i is the total number of reads, and V_i is the "observed variant type",
### which is a binary factor with values in the set {SNP, Mutation}.
### We write D = (D_i) for the collection of all the data.
###
### The two global parameters that we care about are Theta = (nu, S),
### where nu in [0, 1] is the fraction of normal cells and S in {N, D, G}
### is the copy number state. (We simple-mindedly assume that there are
### no subclones to worry about, and we never gain or delete more than one
### copy of anything.)
###
### There are also a collection of nuisance parameters (G_i, T_i) where
### G_i is the true genotype-pair (one for normal, one for tumor) at a variant
### and T_i in {SNP, Mutation, LOH, BackMut} is the true variant type.
###
### By Bayes Rule, the posterior will be given by
###     P(Theta | D) = P(D | Theta) P(Theta) / P(D)
### We can compute the denominator as
###     P(D) = \sum_theta P(D | Theta = theta) P(Theta = theta)
### To implement this step, we will (eventually) just normalize everything
### by brute force.  Before that, however, we have to specify the likelihood
### model and explain how to compute it. We have
###     P(D | Theta) = \prod_i P(D_i | Theta)
### This formula follows from the assumption that the variants are independent
### sources of information. Now we can drop the subscript "i" and just look at
### the likelihood for the data froma single variant. We have
###     P(D | Theta) = \sum_t P(D, T=t | Theta) = \sum_t P(D | T=t, Theta) P(T=t | Theta).
### Note that this formula comes from integrating over the (distinct) possible
### values for the true variant state. We can also integrate over the possible
### genotypes to get
###     P(D | Theta) = \sum_t \sum_g P(D, G=g | T=t, Theta) P(T=t | Theta).
###         = \sum_t \sum_g P(D | G=g, T=t, Theta) P(G=g | T=t, Theta) P(T=t | Theta).
### Now we look at the three pieces we need to compute for that last part.
### First, we have
###     P(t=t | Theta) = P(T=t)
### since the global parameters are irrelevant to the true variant type. Since T is
### a discrete random variable, we just need to specify the distribution used for this
### mmodel. (SNPs are common; mutations are rare.)
### Second, we have
###     P(G=g, | T=t, Theta) = P(G | T, nu, S) = P(G | T, S)
### since the normal fraction nu tells us nothing about the true genotypes.
### This is another discrete distribution that we can specify, but it is more
### compiicated since there are four variant-types (T), three copy number states (S),
### and lots of possible genotypes (maybe as many as 24, but at least 18).
### Finally, we have
###     P(D | G, T, Theta) = P(D | G, T, nu, S) = P(K, N, V | G, T, nu, S)
###         = P(K, N | G, nu) P(V | T)
### since we model the read-counts (K, N) as independent of the observed variant
### type V. The term P(V | T) is yet another discrete distribution. The final piece,
### hoever, is modeled by a biomial distribution
###     P(K, N | G, nu) = Binom(K, N, phi)
### where phi = Phi(G, nu) is completely computable from the genoptype and the
### normal fraction.
###
### The prior on the global  parameters can be specified as
###     P(Theta) = P(nu) P(S)
### where
###     P(nu) is a beta distribution and
###     P(S) is a discrete distribuiton over the three copy number states.

###################################
### These are the normal-tumor genotype pairs that we consider
makeGenotypes <- function() {
  ### create the normal-tumor genotype pairs
  genotypes <- function(M, N) {
    nl <- sapply(0:M, function(m) {
      paste(rep(c("R", "V"), times=c(M-m, m)), collapse="")
    })
    tu <- sapply(0:N, function(m) {
      paste(rep(c("R", "V"), times=c(N-m, m)), collapse="")
    })
    data.frame(Normal=rep(nl, each=M), Tumor=rep(tu, times=N))
  }
  ### only consider gains or deletions of one copy
  g22 <- genotypes(2,2)
  g21 <- genotypes(2,1)
  g23 <- genotypes(2,3)
  gt <- rbind(g22, g21, g23)
  ### we can compute the copy number state from the genotype
  gt$S <- factor(c("Deleted", "Normal", "Gained")[sapply(as.character(gt$Tumor), nchar)])
  ### with work, we can also compute the variant type
  ### start by coiunting the total number of refernce (R) and variant (V) alleles
  av <- sapply(strsplit(paste(as.character(gt$Normal),
                              as.character(gt$Tumor), sep=""), ""),
               function(x) sum(x=="V"))
  ar <- sapply(strsplit(paste(as.character(gt$Normal),
                              as.character(gt$Tumor), sep=""), ""),
               function(x) sum(x=="R"))
  ### if all the reads are the same at this base, we never see it...
  strange <- gt[av*ar==0,]
  ### now have at least one referencde and one variant allele
  gt <- gt[av > 0 & ar > 0,]
  ### count the number R's and V'2s in both nromal (N) and tumor (T)
  NR <- sapply(strsplit(as.character(gt$Normal), ""), function(x) sum(x=="R"))
  NV <- sapply(strsplit(as.character(gt$Normal), ""), function(x) sum(x=="V"))
  TR <- sapply(strsplit(as.character(gt$Tumor), ""), function(x) sum(x=="R"))
  TV <- sapply(strsplit(as.character(gt$Tumor), ""), function(x) sum(x=="V"))
  ### define te four variant types
  mut <- NV == 0 & TV > 0
  bak <- NR == 0 & TR > 0
  loh <- !mut & !bak & (TR==0 | TV==0) & gt$S != "Deleted"
  snp <- !mut & !bak & !loh
  ### assemble them into a four-valued factor
  foo <- rep("BackMut", nrow(gt))
  foo[mut] <- "Mutation"
  foo[loh] <- "LOH"
  foo[snp] <- "SNP"
  gt$V <- factor(foo)
  rownames(gt) <- paste(as.character(gt$Normal), as.character(gt$Tumor), sep=".")
  gt
}

basecase <- makeGenotypes()

cnSet <- levels(basecase$S)                  # copy number status
vtSet <- levels(basecase$V)                  # true variant type

###################################
### Internal models
### KRC: Should we allow users to modify these (pT or pVgT or pGgTS)?

### P(T=t)
### SNPs are common; mutations are rare.
pT <- c("BackMut"=0.001, "LOH"=0.01, "Mutation"=0.1, "SNP"=0.889)

### P(V | T)
### Most of the time, the reported variant type is likely to be correct.
### We're more likely to make the mistake of calling a Mutation a SNP than
### the other way around because of corruption or misunderstandings in the
### dbSNP database.
pVgT <- matrix(c(0.1, 0.9, 0.02, 0.98, 0.99, 0.01, 0.02, 0.98), nrow=2, ncol=4)
rownames(pVgT) <- c("Mutation", "SNP")
colnames(pVgT) <- c("BackMut", "LOH", "Mutation", "SNP")

### P(G | T, S)
### We compute this out of the 'basecase' object. We make the assumption that
### all genotype combinations that exist in 'basecase' are conditionally equally
### likely.
makeDiscretePrior <- function() {
  pGgTS <- array(0,
                 dim=c(nrow(basecase), length(cnSet), length(vtSet)),
                 dimnames = list(NULL, cnSet, vtSet))
  for ( cn in cnSet ) {
    for (vt in vtSet ) {
      w <- which(basecase$S == cn & basecase$V == vt)
      L <- length(w)
      if (L > 0) {
        pGgTS[w, cn,vt] <- 1/L
    }
    }
  }
  pGgTS
}
pGgTS <- makeDiscretePrior()

###################################
### key part of the likelihood model is to compute the expected fraction of
### variant reads as a function of the genotype and the normal fraction, nu.
### We do this by producing functions of from the given genotypes.
CNVariant <- function(gNormal, gTumor) {
  nl <- strsplit(gNormal, "")[[1]]
  rn <- sum(nl == "R")
  vn <- sum(nl == "V")
  tu <- strsplit(gTumor, "")[[1]]
  rt <- sum(tu == "R")
  vt <- sum(tu == "V")
  function(nu) {
    (vn*nu + vt*(1 - nu)) / ((rn + vn)*nu + (rt + vt)*(1 - nu))
  }
}

phis <- sapply(1:nrow(basecase), function(i) {
  gN <- as.character(basecase$Normal[i])
  gT <- as.character(basecase$Tumor[i])
  CNVariant(gN, gT)
})
names(phis) <- rownames(basecase)

###################################
### actually compute the log-likelihood
cnvLikelihood <- function(nus, obs) {
  if (any(nus < 0)) stop("Normal fractions must be non-negative.")
  if (any(nus > 1)) stop("Normal fractions must be between 0 and 1.")
  # need a three-dimensional array
  #  1. grid points over the beta distribution on nu
  #  2. copy number choices * variant type choices (basecase)
  #  3. number of data rows in 'obs'
  hiddenloglike <- array(NA,
                         dim=c(length(nus), nrow(basecase), nrow(obs)),
                         dimnames=list(NULL, rownames(basecase), rownames(obs)))
  pd <- basecase # hidden discrete genotype states
  for (i in 1:nrow(obs)) {   # each data row
    K <- obs$K[i] # variant reads
    N <- obs$N[i] # total reads
    for (pj in 1:nrow(pd)) { # each hidden discrete state
      phi <- phis[[pj]](nus) # expected success as function of nu
      hiddenloglike[,pj,i] <- dbinom(K, N, phi, log=TRUE)
    }
  }
  hiddenloglike
}

###################################
simReads <- function(nMut, nVar, nu, cnstate=cnSet, depth=100, sdepth=25, ...) {
  cnstate <- match.arg(cnstate)
  bc <- basecase[basecase$S == cnstate,]
  wv <- c(sample(which(!(bc$V %in% c("BackMut", "Mutation"))), nVar, replace=TRUE),
          sample(which(bc$V == "Mutation"), nMut, replace=TRUE))
  design <- bc[wv,]
  rownames(design) <- NULL
  design$nu <- nu
  design$N <- round(rnorm(nrow(design), depth, sdepth))
  nVars <- apply(design, 1, function(arow) {
    phi <- CNVariant(as.character(arow[1]), as.character(arow[2]))
    rbinom(1, as.numeric(arow[6]), phi(nu))
  })
  design$K <- nVars
  v <- as.character(design$V)
  v[v!="Mutation"] <- "SNP"
  design$Valt <- design$V
  design$V <- factor(v)
  design
}

Any scripts or data that you put into this service are public.

DeepCNV documentation built on May 2, 2019, 5:23 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

DeepCNV
Exploiting Normal Contamination to Infer Copy Number from Deep Sequencing

R/01-likelihood.R
In DeepCNV: Exploiting Normal Contamination to Infer Copy Number from Deep Sequencing

Defines functions makeGenotypes makeDiscretePrior CNVariant cnvLikelihood simReads

Documented in CNVariant cnvLikelihood simReads

Try the DeepCNV package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

DeepCNV Exploiting Normal Contamination to Infer Copy Number from Deep Sequencing

R/01-likelihood.R In DeepCNV: Exploiting Normal Contamination to Infer Copy Number from Deep Sequencing

Defines functions makeGenotypes makeDiscretePrior CNVariant cnvLikelihood simReads

Documented in CNVariant cnvLikelihood simReads

Try the DeepCNV package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

DeepCNV
Exploiting Normal Contamination to Infer Copy Number from Deep Sequencing

R/01-likelihood.R
In DeepCNV: Exploiting Normal Contamination to Infer Copy Number from Deep Sequencing