runEMclonalCN: Function to run the Expectation Maximization Algorithm in...
In gavinha/TitanCNA: Subclonal copy number and LOH prediction from whole genome sequencing of tumours

Description Usage Arguments Details Value Author(s) References See Also Examples

Function to run the Expectation Maximization Algorithm for inference of model parameters: cellular prevalence, normal proportion, tumour ploidy. This is a key function in the TitanCNA package and is the most computationally intense. This function makes calls to a C subroutine that allows the algorithm to be run more efficiently.

runEMclonalCN(data, params,
              txnExpLen = 1e15, txnZstrength = 5e05, maxiter = 15,
              maxiterUpdate = 1500, pseudoCounts = 1e-300,
              normalEstimateMethod = "map", estimateS = TRUE, 
              estimatePloidy = TRUE, useOutlierState = FALSE, 
              likChangeThreshold = 0.001, verbose = TRUE)

`data`	`list` object that contains the components for the data to be analyzed. `chr`, `posn`, `ref`, and `tumDepth` that can be obtained using `loadAlleleCounts`, and `logR` that can be obtained using `correctReadDepth` and `getPositionOverlap` (see Example).
`params`	`list` object that contains major parameters: `list` object containing copy number and allelic ratio genotype parameters. `list` object containing the normal contamination parameters. `list` object containing the tumour ploidy parameters. `list` object containing the subclonality (cellular prevalence and clonal cluster) parameters. `params` can be obtained from `loadDefaultParameters`.
`txnExpLen`	Influences prior probability of genotype transitions in the HMM. Smaller value have lower tendency to change state; however, too small and it produces underflow problems. `1e-9` works well for up to 3 million total positions.
`txnZstrength`	Influences prior probability of clonal cluster transitions in the HMM. Smaller value have lower tendency to change clonal cluster state. `1e-9` works well for up to 3 million total positions.
`pseudoCounts`	Small, machine precision values to add to probabilities to avoid underflow. For example, `.Machine$double.eps`.
`maxiter`	Maximum number of expectation-maximization iterations allowed. In practice, for TitanCNA, it will usually not exceed 20.
`maxiterUpdate`	Maximum number of coordinate descent iterations during the M-step (of EM algorithm) when parameters are estimated.
`normalEstimateMethod`	Specifies how to handle normal proportion estimation. Using `map` will use the maximum a posteriori estimation. Using `fixed` will not estimate the normal proportion; the normal proportion will be fixed to whatever is specified in `params$normalParams$n_0`. See Details.
`estimateS`	Logical indicating whether to account for clonality and estimate subclonal events. See Details.
`estimatePloidy`	Logical indicating whether to estimate and account for tumour ploidy.
`useOutlierState`	Logical indicating whether an additional outlier state should be used. In practice, this is usually not necessary.
`likChangeThreshold`	EM convergence criteria - stop EM when complete log likelihood changes less than the proportion specified by this argument.
`verbose`	Set to FALSE to suppress program messages.

This function is implemented with the "foreach" package and therefore supports parallelization. See "doMC" or "doMPI" for some parallelization packages.

The forwards-backwards algorithm is used for the E-step in the EM algorithm. This is done using a call to a C subroutine for each chromosome. The maximization step uses maximum a posteriori (MAP) for estimation of parameters.

If the sample has absolutely no normal contamination, then assign nParams$n_0 <- 0 and use argument normalEstimateMethod="fixed".

estimateS should always be set to TRUE. If no subclonality is expected, then use loadDefaultParameters(numberClonalClusters=1). Using estimateS=FALSE and loadDefaultParameters(numberClonalClusters=0) is gives more or less the same results.

list with components for results returned from the EM algorithm, including converged parameters, posterior marginal responsibilities, log likelihood, and original parameter settings.

`n`	Converged estimate for normal contamination parameter. `numeric array` containing estimates at each EM iteration.
`s`	Converged estimate(s) for cellular prevalence parameter(s). This value is defined as the proportion of tumour sample that does not contain the aberrant genotype. This will contrast what is output in `outputTitanResults`. `numeric array` containing estimates at each EM iteration. If more than one cluster is specified, then `s` is a `numeric matrix`.
`var`	Converged estimates for variance parameter of the Gaussian mixtures used to model the log ratio data. `numeric matrix` containing estimates at each EM iteration.
`phi`	Converged estimate for tumour ploidy parameter. `numeric array` containing estimates at each EM iteration.
`piG`	Converged estimate for initial genotype state distribution. `numeric matrix` containing estimates at each EM iteration.
`piZ`	Converged estimate for initial clonal cluster state distribution. `numeric matrix` containing estimates at each EM iteration.
`muR`	Mean of binomial mixtures computed as a function of `s` and `n`. `numeric matrix` containing estimates at each EM iteration. See References for mathematical details.
`muC`	Mean of Gaussian mixtures computed as a function of `s`, `n`, and `phi`. `numeric matrix` containing estimates at each EM iteration. See References for mathematical details.
`loglik`	Posterior Log-likelihood that includes data likelihood and the priors. `numeric array` containing estimates at each EM iteration.
`rhoG`	Posterior marginal probabilities for the genotype states computed during the E-step. Only the final iteration is returned as a `numeric matrix`.
`rhoZ`	Posterior marginal probabilities for the clonal cluster states computed during the E-step. Only the final iteration is returned as a `numeric matrix`.
`genotypeParams`	Original genotype parameters. See `loadDefaultParameters`.
`ploidyParams`	Original tumour ploidy parameters. See `loadDefaultParameters`.
`normalParams`	Original normal contamination parameters. See `loadDefaultParameters`.
`clonalParams`	Original subclonal parameters. See `loadDefaultParameters`.
`txnExpLen`	Original genotype transition expected length. See `loadDefaultParameters`.
`txnZstrength`	Original clonal cluster transition expected length. See `loadDefaultParameters`.
`useOutlierState`	Original setting indicating usage of outlier state. See `loadDefaultParameters`.

Gavin Ha <gavinha@gmail.com>

Ha, G., Roth, A., Khattra, J., Ho, J., Yap, D., Prentice, L. M., Melnyk, N., McPherson, A., Bashashati, A., Laks, E., Biele, J., Ding, J., Le, A., Rosner, J., Shumansky, K., Marra, M. A., Huntsman, D. G., McAlpine, J. N., Aparicio, S. A. J. R., and Shah, S. P. (2014). TITAN: Inference of copy number architectures in clonal cell populations from tumour whole genome sequence data. Genome Research, 24: 1881-1893. (PMID: 25060187)

"foreach", "doMC", "doMPI", loadAlleleCounts, loadDefaultParameters, viterbiClonalCN

message('Running TITAN ...')
#### LOAD DATA ####
infile <- system.file("extdata", "test_alleleCounts_chr2.txt", 
              package = "TitanCNA")
data <- loadAlleleCounts(infile)

#### LOAD PARAMETERS ####
message('titan: Loading default parameters')
numClusters <- 2
params <- loadDefaultParameters(copyNumber = 5, 
              numberClonalClusters = numClusters, skew = 0.1)

#### READ COPY NUMBER FROM HMMCOPY FILE ####
message('titan: Correcting GC content and mappability biases...')
tumWig <- system.file("extdata", "test_tum_chr2.wig", package = "TitanCNA")
normWig <- system.file("extdata", "test_norm_chr2.wig", package = "TitanCNA")
gc <- system.file("extdata", "gc_chr2.wig", package = "TitanCNA")
map <- system.file("extdata", "map_chr2.wig", package = "TitanCNA")
cnData <- correctReadDepth(tumWig, normWig, gc, map)
logR <- getPositionOverlap(data$chr, data$posn, cnData)
data$logR <- log(2^logR) #transform to natural log

#### FILTER DATA FOR DEPTH, MAPPABILITY, NA, etc ####
data <- filterData(data, 1:24, minDepth = 10, maxDepth = 200, map = NULL)

#### EM (FWD-BACK) TO TRAIN PARAMETERS ####
#### Can use parallelization packages ####
K <- length(params$genotypeParams$alphaKHyper)
params$genotypeParams$alphaKHyper <- rep(500, K)
params$ploidyParams$phi_0 <- 1.5
convergeParams <- runEMclonalCN(data, params,
                                maxiter = 3, maxiterUpdate = 500, 
                                txnExpLen = 1e15, txnZstrength = 5e5, 
                                useOutlierState = FALSE, 
                                normalEstimateMethod = "map", 
                                estimateS = TRUE, estimatePloidy = TRUE)