runEMclonalCN: Function to run the Expectation Maximization Algorithm in...

Description Usage Arguments Details Value Author(s) References See Also Examples

View source: R/hmmClonal.R

Description

Function to run the Expectation Maximization Algorithm for inference of model parameters: cellular prevalence, normal proportion, tumour ploidy. This is a key function in the TitanCNA package and is the most computationally intense. This function makes calls to a C subroutine that allows the algorithm to be run more efficiently.

Usage

1
2
3
4
5
6
runEMclonalCN(data, params,
              txnExpLen = 1e15, txnZstrength = 5e05, maxiter = 15,
              maxiterUpdate = 1500, pseudoCounts = 1e-300,
              normalEstimateMethod = "map", estimateS = TRUE, 
              estimatePloidy = TRUE, useOutlierState = FALSE, 
              likChangeThreshold = 0.001, verbose = TRUE)

Arguments

data

list object that contains the components for the data to be analyzed. chr, posn, ref, and tumDepth that can be obtained using loadAlleleCounts, and logR that can be obtained using correctReadDepth and getPositionOverlap (see Example).

params

list object that contains major parameters: list object containing copy number and allelic ratio genotype parameters. list object containing the normal contamination parameters. list object containing the tumour ploidy parameters. list object containing the subclonality (cellular prevalence and clonal cluster) parameters. params can be obtained from loadDefaultParameters.

txnExpLen

Influences prior probability of genotype transitions in the HMM. Smaller value have lower tendency to change state; however, too small and it produces underflow problems. 1e-9 works well for up to 3 million total positions.

txnZstrength

Influences prior probability of clonal cluster transitions in the HMM. Smaller value have lower tendency to change clonal cluster state. 1e-9 works well for up to 3 million total positions.

pseudoCounts

Small, machine precision values to add to probabilities to avoid underflow. For example, .Machine$double.eps.

maxiter

Maximum number of expectation-maximization iterations allowed. In practice, for TitanCNA, it will usually not exceed 20.

maxiterUpdate

Maximum number of coordinate descent iterations during the M-step (of EM algorithm) when parameters are estimated.

normalEstimateMethod

Specifies how to handle normal proportion estimation. Using map will use the maximum a posteriori estimation. Using fixed will not estimate the normal proportion; the normal proportion will be fixed to whatever is specified in params$normalParams$n_0. See Details.

estimateS

Logical indicating whether to account for clonality and estimate subclonal events. See Details.

estimatePloidy

Logical indicating whether to estimate and account for tumour ploidy.

useOutlierState

Logical indicating whether an additional outlier state should be used. In practice, this is usually not necessary.

likChangeThreshold

EM convergence criteria - stop EM when complete log likelihood changes less than the proportion specified by this argument.

verbose

Set to FALSE to suppress program messages.

Details

This function is implemented with the "foreach" package and therefore supports parallelization. See "doMC" or "doMPI" for some parallelization packages.

The forwards-backwards algorithm is used for the E-step in the EM algorithm. This is done using a call to a C subroutine for each chromosome. The maximization step uses maximum a posteriori (MAP) for estimation of parameters.

If the sample has absolutely no normal contamination, then assign nParams$n_0 <- 0 and use argument normalEstimateMethod="fixed".

estimateS should always be set to TRUE. If no subclonality is expected, then use loadDefaultParameters(numberClonalClusters=1). Using estimateS=FALSE and loadDefaultParameters(numberClonalClusters=0) is gives more or less the same results.

Value

list with components for results returned from the EM algorithm, including converged parameters, posterior marginal responsibilities, log likelihood, and original parameter settings.

n

Converged estimate for normal contamination parameter. numeric array containing estimates at each EM iteration.

s

Converged estimate(s) for cellular prevalence parameter(s). This value is defined as the proportion of tumour sample that does not contain the aberrant genotype. This will contrast what is output in outputTitanResults. numeric array containing estimates at each EM iteration. If more than one cluster is specified, then s is a numeric matrix.

var

Converged estimates for variance parameter of the Gaussian mixtures used to model the log ratio data. numeric matrix containing estimates at each EM iteration.

phi

Converged estimate for tumour ploidy parameter. numeric array containing estimates at each EM iteration.

piG

Converged estimate for initial genotype state distribution. numeric matrix containing estimates at each EM iteration.

piZ

Converged estimate for initial clonal cluster state distribution. numeric matrix containing estimates at each EM iteration.

muR

Mean of binomial mixtures computed as a function of s and n. numeric matrix containing estimates at each EM iteration. See References for mathematical details.

muC

Mean of Gaussian mixtures computed as a function of s, n, and phi. numeric matrix containing estimates at each EM iteration. See References for mathematical details.

loglik

Posterior Log-likelihood that includes data likelihood and the priors. numeric array containing estimates at each EM iteration.

rhoG

Posterior marginal probabilities for the genotype states computed during the E-step. Only the final iteration is returned as a numeric matrix.

rhoZ

Posterior marginal probabilities for the clonal cluster states computed during the E-step. Only the final iteration is returned as a numeric matrix.

genotypeParams

Original genotype parameters. See loadDefaultParameters.

ploidyParams

Original tumour ploidy parameters. See loadDefaultParameters.

normalParams

Original normal contamination parameters. See loadDefaultParameters.

clonalParams

Original subclonal parameters. See loadDefaultParameters.

txnExpLen

Original genotype transition expected length. See loadDefaultParameters.

txnZstrength

Original clonal cluster transition expected length. See loadDefaultParameters.

useOutlierState

Original setting indicating usage of outlier state. See loadDefaultParameters.

Author(s)

Gavin Ha <gavinha@gmail.com>

References

Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)

See Also

"foreach", "doMC", "doMPI", loadAlleleCounts, loadDefaultParameters, viterbiClonalCN

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
message('Running TITAN ...')
#### LOAD DATA ####
infile <- system.file("extdata", "test_alleleCounts_chr2.txt", 
              package = "TitanCNA")
data <- loadAlleleCounts(infile)

#### LOAD PARAMETERS ####
message('titan: Loading default parameters')
numClusters <- 2
params <- loadDefaultParameters(copyNumber = 5, 
              numberClonalClusters = numClusters, skew = 0.1)

#### READ COPY NUMBER FROM HMMCOPY FILE ####
message('titan: Correcting GC content and mappability biases...')
tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA")
normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA")
gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA")
map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA")
cnData <- correctReadDepth(tumWig, normWig, gc, map)
logR <- getPositionOverlap(data$chr, data$posn, cnData)
data$logR <- log(2^logR) #transform to natural log

#### FILTER DATA FOR DEPTH, MAPPABILITY, NA, etc ####
data <- filterData(data, 1:24, minDepth = 10, maxDepth = 200, map = NULL)

#### EM (FWD-BACK) TO TRAIN PARAMETERS ####
#### Can use parallelization packages ####
K <- length(params$genotypeParams$alphaKHyper)
params$genotypeParams$alphaKHyper <- rep(500, K)
params$ploidyParams$phi_0 <- 1.5
convergeParams <- runEMclonalCN(data, params,
                                maxiter = 3, maxiterUpdate = 500, 
                                txnExpLen = 1e15, txnZstrength = 5e5, 
                                useOutlierState = FALSE, 
                                normalEstimateMethod = "map", 
                                estimateS = TRUE, estimatePloidy = TRUE)

gavinha/TitanCNA documentation built on April 22, 2021, 9:38 a.m.