R/01-likelihood.R

Defines functions simReads cnvLikelihood CNVariant makeDiscretePrior makeGenotypes

Documented in CNVariant cnvLikelihood simReads

# Copyright (C) Kevin R. Coombes 2014-2017

###################################
### To understand what's going on, we need to briefly explain the whole model.
###
### The observed data is considered to be a collection of triples of the form
###     D_i = (K_i, N_i, V_i)
### where for the observed variant i, K_i is the number of variant reads,
### N_i is the total number of reads, and V_i is the "observed variant type",
### which is a binary factor with values in the set {SNP, Mutation}.
### We write D = (D_i) for the collection of all the data.
###
### The two global parameters that we care about are Theta = (nu, S),
### where nu in [0, 1] is the fraction of normal cells and S in {N, D, G}
### is the copy number state. (We simple-mindedly assume that there are
### no subclones to worry about, and we never gain or delete more than one
### copy of anything.)
###
### There are also a collection of nuisance parameters (G_i, T_i) where
### G_i is the true genotype-pair (one for normal, one for tumor) at a variant
### and T_i in {SNP, Mutation, LOH, BackMut} is the true variant type.
###
### By Bayes Rule, the posterior will be given by
###     P(Theta | D) = P(D | Theta) P(Theta) / P(D)
### We can compute the denominator as
###     P(D) = \sum_theta P(D | Theta = theta) P(Theta = theta)
### To implement this step, we will (eventually) just normalize everything
### by brute force.  Before that, however, we have to specify the likelihood
### model and explain how to compute it. We have
###     P(D | Theta) = \prod_i P(D_i | Theta)
### This formula follows from the assumption that the variants are independent
### sources of information. Now we can drop the subscript "i" and just look at
### the likelihood for the data froma single variant. We have
###     P(D | Theta) = \sum_t P(D, T=t | Theta) = \sum_t P(D | T=t, Theta) P(T=t | Theta).
### Note that this formula comes from integrating over the (distinct) possible
### values for the true variant state. We can also integrate over the possible
### genotypes to get
###     P(D | Theta) = \sum_t \sum_g P(D, G=g | T=t, Theta) P(T=t | Theta).
###         = \sum_t \sum_g P(D | G=g, T=t, Theta) P(G=g | T=t, Theta) P(T=t | Theta).
### Now we look at the three pieces we need to compute for that last part.
### First, we have
###     P(t=t | Theta) = P(T=t)
### since the global parameters are irrelevant to the true variant type. Since T is
### a discrete random variable, we just need to specify the distribution used for this
### mmodel. (SNPs are common; mutations are rare.)
### Second, we have
###     P(G=g, | T=t, Theta) = P(G | T, nu, S) = P(G | T, S)
### since the normal fraction nu tells us nothing about the true genotypes.
### This is another discrete distribution that we can specify, but it is more
### compiicated since there are four variant-types (T), three copy number states (S),
### and lots of possible genotypes (maybe as many as 24, but at least 18).
### Finally, we have
###     P(D | G, T, Theta) = P(D | G, T, nu, S) = P(K, N, V | G, T, nu, S)
###         = P(K, N | G, nu) P(V | T)
### since we model the read-counts (K, N) as independent of the observed variant
### type V. The term P(V | T) is yet another discrete distribution. The final piece,
### hoever, is modeled by a biomial distribution
###     P(K, N | G, nu) = Binom(K, N, phi)
### where phi = Phi(G, nu) is completely computable from the genoptype and the
### normal fraction.
###
### The prior on the global  parameters can be specified as
###     P(Theta) = P(nu) P(S)
### where
###     P(nu) is a beta distribution and
###     P(S) is a discrete distribuiton over the three copy number states.

###################################
### These are the normal-tumor genotype pairs that we consider
makeGenotypes <- function() {
  ### create the normal-tumor genotype pairs
  genotypes <- function(M, N) {
    nl <- sapply(0:M, function(m) {
      paste(rep(c("R", "V"), times=c(M-m, m)), collapse="")
    })
    tu <- sapply(0:N, function(m) {
      paste(rep(c("R", "V"), times=c(N-m, m)), collapse="")
    })
    data.frame(Normal=rep(nl, each=M), Tumor=rep(tu, times=N))
  }
  ### only consider gains or deletions of one copy
  g22 <- genotypes(2,2)
  g21 <- genotypes(2,1)
  g23 <- genotypes(2,3)
  gt <- rbind(g22, g21, g23)
  ### we can compute the copy number state from the genotype
  gt$S <- factor(c("Deleted", "Normal", "Gained")[sapply(as.character(gt$Tumor), nchar)])
  ### with work, we can also compute the variant type
  ### start by coiunting the total number of refernce (R) and variant (V) alleles
  av <- sapply(strsplit(paste(as.character(gt$Normal),
                              as.character(gt$Tumor), sep=""), ""),
               function(x) sum(x=="V"))
  ar <- sapply(strsplit(paste(as.character(gt$Normal),
                              as.character(gt$Tumor), sep=""), ""),
               function(x) sum(x=="R"))
  ### if all the reads are the same at this base, we never see it...
  strange <- gt[av*ar==0,]
  ### now have at least one referencde and one variant allele
  gt <- gt[av > 0 & ar > 0,]
  ### count the number R's and V'2s in both nromal (N) and tumor (T)
  NR <- sapply(strsplit(as.character(gt$Normal), ""), function(x) sum(x=="R"))
  NV <- sapply(strsplit(as.character(gt$Normal), ""), function(x) sum(x=="V"))
  TR <- sapply(strsplit(as.character(gt$Tumor), ""), function(x) sum(x=="R"))
  TV <- sapply(strsplit(as.character(gt$Tumor), ""), function(x) sum(x=="V"))
  ### define te four variant types
  mut <- NV == 0 & TV > 0
  bak <- NR == 0 & TR > 0
  loh <- !mut & !bak & (TR==0 | TV==0) & gt$S != "Deleted"
  snp <- !mut & !bak & !loh
  ### assemble them into a four-valued factor
  foo <- rep("BackMut", nrow(gt))
  foo[mut] <- "Mutation"
  foo[loh] <- "LOH"
  foo[snp] <- "SNP"
  gt$V <- factor(foo)
  rownames(gt) <- paste(as.character(gt$Normal), as.character(gt$Tumor), sep=".")
  gt
}

basecase <- makeGenotypes()

cnSet <- levels(basecase$S)                  # copy number status
vtSet <- levels(basecase$V)                  # true variant type

###################################
### Internal models
### KRC: Should we allow users to modify these (pT or pVgT or pGgTS)?

### P(T=t)
### SNPs are common; mutations are rare.
pT <- c("BackMut"=0.001, "LOH"=0.01, "Mutation"=0.1, "SNP"=0.889)

### P(V | T)
### Most of the time, the reported variant type is likely to be correct.
### We're more likely to make the mistake of calling a Mutation a SNP than
### the other way around because of corruption or misunderstandings in the
### dbSNP database.
pVgT <- matrix(c(0.1, 0.9, 0.02, 0.98, 0.99, 0.01, 0.02, 0.98), nrow=2, ncol=4)
rownames(pVgT) <- c("Mutation", "SNP")
colnames(pVgT) <- c("BackMut", "LOH", "Mutation", "SNP")

### P(G | T, S)
### We compute this out of the 'basecase' object. We make the assumption that
### all genotype combinations that exist in 'basecase' are conditionally equally
### likely.
makeDiscretePrior <- function() {
  pGgTS <- array(0,
                 dim=c(nrow(basecase), length(cnSet), length(vtSet)),
                 dimnames = list(NULL, cnSet, vtSet))
  for ( cn in cnSet ) {
    for (vt in vtSet ) {
      w <- which(basecase$S == cn & basecase$V == vt)
      L <- length(w)
      if (L > 0) {
        pGgTS[w, cn,vt] <- 1/L
    }
    }
  }
  pGgTS
}
pGgTS <- makeDiscretePrior()

###################################
### key part of the likelihood model is to compute the expected fraction of
### variant reads as a function of the genotype and the normal fraction, nu.
### We do this by producing functions of from the given genotypes.
CNVariant <- function(gNormal, gTumor) {
  nl <- strsplit(gNormal, "")[[1]]
  rn <- sum(nl == "R")
  vn <- sum(nl == "V")
  tu <- strsplit(gTumor, "")[[1]]
  rt <- sum(tu == "R")
  vt <- sum(tu == "V")
  function(nu) {
    (vn*nu + vt*(1 - nu)) / ((rn + vn)*nu + (rt + vt)*(1 - nu))
  }
}

phis <- sapply(1:nrow(basecase), function(i) {
  gN <- as.character(basecase$Normal[i])
  gT <- as.character(basecase$Tumor[i])
  CNVariant(gN, gT)
})
names(phis) <- rownames(basecase)

###################################
### actually compute the log-likelihood
cnvLikelihood <- function(nus, obs) {
  if (any(nus < 0)) stop("Normal fractions must be non-negative.")
  if (any(nus > 1)) stop("Normal fractions must be between 0 and 1.")
  # need a three-dimensional array
  #  1. grid points over the beta distribution on nu
  #  2. copy number choices * variant type choices (basecase)
  #  3. number of data rows in 'obs'
  hiddenloglike <- array(NA,
                         dim=c(length(nus), nrow(basecase), nrow(obs)),
                         dimnames=list(NULL, rownames(basecase), rownames(obs)))
  pd <- basecase # hidden discrete genotype states
  for (i in 1:nrow(obs)) {   # each data row
    K <- obs$K[i] # variant reads
    N <- obs$N[i] # total reads
    for (pj in 1:nrow(pd)) { # each hidden discrete state
      phi <- phis[[pj]](nus) # expected success as function of nu
      hiddenloglike[,pj,i] <- dbinom(K, N, phi, log=TRUE)
    }
  }
  hiddenloglike
}

###################################
simReads <- function(nMut, nVar, nu, cnstate=cnSet, depth=100, sdepth=25, ...) {
  cnstate <- match.arg(cnstate)
  bc <- basecase[basecase$S == cnstate,]
  wv <- c(sample(which(!(bc$V %in% c("BackMut", "Mutation"))), nVar, replace=TRUE),
          sample(which(bc$V == "Mutation"), nMut, replace=TRUE))
  design <- bc[wv,]
  rownames(design) <- NULL
  design$nu <- nu
  design$N <- round(rnorm(nrow(design), depth, sdepth))
  nVars <- apply(design, 1, function(arow) {
    phi <- CNVariant(as.character(arow[1]), as.character(arow[2]))
    rbinom(1, as.numeric(arow[6]), phi(nu))
  })
  design$K <- nVars
  v <- as.character(design$V)
  v[v!="Mutation"] <- "SNP"
  design$Valt <- design$V
  design$V <- factor(v)
  design
}

Try the DeepCNV package in your browser

Any scripts or data that you put into this service are public.

DeepCNV documentation built on May 2, 2019, 5:23 p.m.