These functions implement a variety of initialization methods for the parameters of a Poisson mixture model: the Small EM initialization strategy (`emInit`

) described in Rau et al. (2011), a K-means initialization strategy (`kmeanInit`

) that is itself used to initialize the small EM strategy, the splitting small-EM initialization strategy (`splitEMInit`

) based on that described in Papastamoulis et al. (2014), and a function to initialize a small-EM strategy using the posterior probabilities (`probaPostInit`

) obtained from a previous run with one fewer cluster following the splitting strategy.

1 2 3 4 5 6 7 8 9 10 11 12 13 | ```
emInit(y, g, conds, norm, alg.type = "EM",
init.runs, init.iter, fixed.lambda, equal.proportions, verbose)
kmeanInit(y, g, conds, norm, fixed.lambda,
equal.proportions)
splitEMInit(y, g, conds, norm, alg.type, fixed.lambda,
equal.proportions, prev.labels, prev.probaPost, init.runs,
init.iter, verbose)
probaPostInit(y, g, conds, norm, alg.type = "EM",
fixed.lambda, equal.proportions, probaPost.init, init.iter,
verbose)
``` |

`y` |
( |

`g` |
Number of clusters. If |

`conds` |
Vector of length |

`norm` |
The type of estimator to be used to normalize for differences in library size: (“ |

`alg.type` |
Algorithm to be used for parameter estimation (“ |

`init.runs` |
In the case of the Small-EM algorithm, the number of independent runs to be performed. In the case of the splitting Small-EM algorithm, the number of cluster splits to be performed in the splitting small-EM initialization. |

`init.iter` |
The number of iterations to run within each Small-EM algorithm |

`fixed.lambda` |
If one (or more) clusters with fixed values of lambda is desires, a list containing vectors of length |

`equal.proportions` |
If |

`prev.labels` |
A vector of length |

`prev.probaPost` |
An |

`probaPost.init` |
An |

`verbose` |
If |

In practice, the user will not directly call the initialization functions described here; they are indirectly called
for a single number of clusters through the `PoisMixClus`

function (via `init.type`

) or via the
`PoisMixClusWrapper`

function for a sequence of cluster numbers (via `gmin.init.type`

and `split.init`

).

To initialize parameter values for the EM and CEM algorithms, for the Small-EM strategy (Biernacki et al., 2003) we use the `emInit`

function as follows. For a given number of independent runs (given by `init.runs`

), the following procedure is used to obtain parameter values: first, a K-means algorithm (MacQueen, 1967) is run to partition the data into `g`

clusters (*\hat{z}^(0)*). Second, initial parameter values *π^(0)* and *λ^(0)* are calculated (see Rau et al. (2011) for details). Third, a given number of iterations of an EM algorithm are run (defined by `init.iter`

), using *π^(0)* and *λ^(0)* as initial values. Finally, among the `init.runs`

sets of parameter values, we use *\hat{λ}* and *\hat{π}* corresponding to the highest log likelihood or completed log likelihood to initialize the subsequent full EM or CEM algorithms, respectively.

For the splitting small EM initialization strategy, we implement an approach similar to that described in Papastamoulis et al. (2014),
where the cluster from the previous run (with *g*-1 clusters) with the largest entropy is chosen to be split into two new clusters,
followed by a small EM run as described above.

`pi.init ` |
Vector of length |

`lambda.init ` |
( |

`lambda ` |
( |

`pi ` |
Vector of length |

`log.like ` |
Log likelihood arising from the splitting initialization and small EM run for a single split. |

Andrea Rau <andrea.rau@jouy.inra.fr>

Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. *Genome Biology*, **11**(R106), 1-28.

Biernacki, C., Celeux, G., Govaert, G. (2003) Choosing starting values for the EM algorithm for getting the highest likelhiood in multivariate Gaussian mixture models. *Computational Statistics and Data Analysis*, **41**(1), 561-575.

MacQueen, J. B. (1967) Some methods for classification and analysis of multivariate observations. In *Proceedings of the 5th Berkeley Symposium on Mathematical Statistics and Probability*, number 1, pages 281-297. Berkeley, University of California Press.

Papastamoulis, P., Martin-Magniette, M.-L., and Maugis-Rabusseau, C. (2014). On the estimation of mixtures of Poisson regression models with large number of components. *Computational Statistics and Data Analysis*: 3rd special Issue on Advances in Mixture Models, DOI: 10.1016/j.csda.2014.07.005.

Rau, A., Celeux, G., Martin-Magniette, M.-L., Maugis-Rabusseau, C. (2011). Clustering high-throughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at http://hal.inria.fr/inria-00638082.

Rau, A., Maugis-Rabusseau, C., Martin-Magniette, M.-L., Celeux G. (2015). Co-expression analysis of high-throughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):1420-1427.

Robinson, M. D. and Oshlack, A. (2010) A scaling normalization method for differential expression analysis of RNA-seq data. *Genome Biology*, **11**(R25).

`PoisMixClus`

for Poisson mixture model estimation for a given number of clusters,
`PoisMixClusWrapper`

for Poisson mixture model estimation and model selection for a sequence of cluster numbers.

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 | ```
set.seed(12345)
## Simulate data as shown in Rau et al. (2011)
## Library size setting "A", high cluster separation
## n = 500 observations
simulate <- PoisMixSim(n = 500, libsize = "A", separation = "high")
y <- simulate$y
conds <- simulate$conditions
## Calculate initial values for lambda and pi using the Small-EM
## initialization (4 classes, PMM-II model with "TC" library size)
##
## init.values <- emInit(y, g = 4, conds,
## norm = "TC", alg.type = "EM",
## init.runs = 50, init.iter = 10, fixed.lambda = NA,
## equal.proportions = FALSE, verbose = FALSE)
## pi.init <- init.values$pi.init
## lambda.init <- init.values$lambda.init
``` |

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

All documentation is copyright its authors; we didn't write any of that.