# Init: Parameter initialization for a Poisson mixture model. In HTSCluster: Clustering High-Throughput Transcriptome Sequencing (HTS) Data

## Description

These functions implement a variety of initialization methods for the parameters of a Poisson mixture model: the Small EM initialization strategy (emInit) described in Rau et al. (2011), a K-means initialization strategy (kmeanInit) that is itself used to initialize the small EM strategy, the splitting small-EM initialization strategy (splitEMInit) based on that described in Papastamoulis et al. (2014), and a function to initialize a small-EM strategy using the posterior probabilities (probaPostInit) obtained from a previous run with one fewer cluster following the splitting strategy.

## Usage

  1 2 3 4 5 6 7 8 9 10 11 12 13 emInit(y, g, conds, norm, alg.type = "EM", init.runs, init.iter, fixed.lambda, equal.proportions, verbose) kmeanInit(y, g, conds, norm, fixed.lambda, equal.proportions) splitEMInit(y, g, conds, norm, alg.type, fixed.lambda, equal.proportions, prev.labels, prev.probaPost, init.runs, init.iter, verbose) probaPostInit(y, g, conds, norm, alg.type = "EM", fixed.lambda, equal.proportions, probaPost.init, init.iter, verbose) 

## Arguments

 y (n x q) matrix of observed counts for n observations and q variables g Number of clusters. If fixed.lambda contains a list of lambda values to be fixed, g corresponds to the number of clusters in addition to those fixed. conds Vector of length q defining the condition (treatment group) for each variable (column) in y norm The type of estimator to be used to normalize for differences in library size: (“TC” for total count, “UQ” for upper quantile, “Med” for median, “DESeq” for the normalization method in the DESeq package, and “TMM” for the TMM normalization method (Robinson and Oshlack, 2010). Can also be a vector (of length q) containing pre-estimated library size estimates for each sample. alg.type Algorithm to be used for parameter estimation (“EM” or “CEM” for the EM or CEM algorithms, respectively) init.runs In the case of the Small-EM algorithm, the number of independent runs to be performed. In the case of the splitting Small-EM algorithm, the number of cluster splits to be performed in the splitting small-EM initialization. init.iter The number of iterations to run within each Small-EM algorithm fixed.lambda If one (or more) clusters with fixed values of lambda is desires, a list containing vectors of length d (the number of conditions). Note that the values of lambda chosen must satisfy the constraint noted in the technical report. equal.proportions If TRUE, the cluster proportions are set to be equal for all clusters. Default is FALSE (unequal cluster proportions) prev.labels A vector of length n of cluster labels obtained from the previous run (g-1 clusters) prev.probaPost An n x (g-1) matrix of the conditional probabilities of each observation belonging to each of the g-1 clusters from the previous run probaPost.init An n x (g) matrix of the conditional probabilities of each observation belonging to each of the g clusters following the splitting strategy in the splitEMInit function verbose If TRUE, include verbose output

## Details

In practice, the user will not directly call the initialization functions described here; they are indirectly called for a single number of clusters through the PoisMixClus function (via init.type) or via the PoisMixClusWrapper function for a sequence of cluster numbers (via gmin.init.type and split.init).

To initialize parameter values for the EM and CEM algorithms, for the Small-EM strategy (Biernacki et al., 2003) we use the emInit function as follows. For a given number of independent runs (given by init.runs), the following procedure is used to obtain parameter values: first, a K-means algorithm (MacQueen, 1967) is run to partition the data into g clusters (\hat{z}^(0)). Second, initial parameter values π^(0) and λ^(0) are calculated (see Rau et al. (2011) for details). Third, a given number of iterations of an EM algorithm are run (defined by init.iter), using π^(0) and λ^(0) as initial values. Finally, among the init.runs sets of parameter values, we use \hat{λ} and \hat{π} corresponding to the highest log likelihood or completed log likelihood to initialize the subsequent full EM or CEM algorithms, respectively.

For the splitting small EM initialization strategy, we implement an approach similar to that described in Papastamoulis et al. (2014), where the cluster from the previous run (with g-1 clusters) with the largest entropy is chosen to be split into two new clusters, followed by a small EM run as described above.

## Value

 pi.init  Vector of length g containing the estimate for \hat{π} corresponding to the highest log likelihood (or completed log likelihood) from the chosen inialization strategy. lambda.init  (d x g) matrix containing the estimate of \hat{λ} corresponding to the highest log likelihood (or completed log likelihood) from the chosen initialization strategy, where d is the number of conditions and g is the number of clusters. lambda  (d x g) matrix containing the estimate of \hat{λ} arising from the splitting initialization and small EM run for a single split, where d is the number of conditions and g is the number of clusters. pi  Vector of length g containing the estimate for \hat{π} arising from the splitting initialization and small EM run for a single split, where g is the number of clusters. log.like  Log likelihood arising from the splitting initialization and small EM run for a single split.

## Author(s)

Andrea Rau <andrea.rau@jouy.inra.fr>

## References

Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11(R106), 1-28.

Biernacki, C., Celeux, G., Govaert, G. (2003) Choosing starting values for the EM algorithm for getting the highest likelhiood in multivariate Gaussian mixture models. Computational Statistics and Data Analysis, 41(1), 561-575.

MacQueen, J. B. (1967) Some methods for classification and analysis of multivariate observations. In Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability, number 1, pages 281-297. Berkeley, University of California Press.

Papastamoulis, P., Martin-Magniette, M.-L., and Maugis-Rabusseau, C. (2014). On the estimation of mixtures of Poisson regression models with large number of components. Computational Statistics and Data Analysis: 3rd special Issue on Advances in Mixture Models, DOI: 10.1016/j.csda.2014.07.005.

Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at http://hal.inria.fr/inria-00638082.

Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.

Robinson, M. D. and Oshlack, A. (2010) A scaling normalization method for differential expression analysis of RNA-seq data. Genome Biology, 11(R25).

PoisMixClus for Poisson mixture model estimation for a given number of clusters, PoisMixClusWrapper for Poisson mixture model estimation and model selection for a sequence of cluster numbers.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 set.seed(12345) ## Simulate data as shown in Rau et al. (2011) ## Library size setting "A", high cluster separation ## n = 500 observations simulate <- PoisMixSim(n = 500, libsize = "A", separation = "high") y <- simulate$y conds <- simulate$conditions ## Calculate initial values for lambda and pi using the Small-EM ## initialization (4 classes, PMM-II model with "TC" library size) ## ## init.values <- emInit(y, g = 4, conds, ## norm = "TC", alg.type = "EM", ## init.runs = 50, init.iter = 10, fixed.lambda = NA, ## equal.proportions = FALSE, verbose = FALSE) ## pi.init <- init.values$pi.init ## lambda.init <- init.values$lambda.init