Poisson mixture model estimation and model selection
Description
These functions implement the EM and CEM algorithms for parameter estimation in a Poisson mixture model for clustering high throughput sequencing observations (e.g., genes) for a single number of clusters (PoisMixClus
) or a sequence of cluster numbers (PoisMixClusWrapper
). Parameters are initialized using a SmallEM strategy as described in Rau et al. (2011) or the splitting smallEM strategy described in Papastamoulis et al. (2014), and model selection is performed using the ICL criteria. Note that these functions implement the PMMI and PMMII models described in Rau et al. (2011).
Usage
1 2 3 4 5 6 7 8 9 10 11 12 13  PoisMixClus(y, g, conds, norm = "TMM",
init.type = "smallem", init.runs = 1, init.iter = 10,
alg.type = "EM", cutoff = 10e6, iter = 1000, fixed.lambda = NA,
equal.proportions = FALSE, prev.labels = NA,
prev.probaPost = NA, verbose = FALSE, interpretation = "sum",
EM.verbose = FALSE, wrapper = FALSE, subset.index = NA)
PoisMixClusWrapper(y, gmin = 1, gmax, conds,
norm = "TMM", gmin.init.type = "smallem",
init.runs = 1, init.iter = 10, split.init = TRUE, alg.type = "EM",
cutoff = 10e6, iter = 1000, fixed.lambda = NA,
equal.proportions = FALSE, verbose = FALSE, interpretation = "sum",
EM.verbose = FALSE, subset.index = NA)

Arguments
y 
(n x q) matrix of observed counts for n observations and q variables 
g 
Number of clusters (a single value). If 
gmin 
The minimum number of clusters in a sequence to be tested. In cases where clusters are included with a fixed value
of lambda, 
gmax 
The maximum number of clusters in a sequence to be tested. In cases where clusters are included with a fixed value
of lambda, 
conds 
Vector of length q defining the condition (treatment group) for each variable (column) in 
norm 
The type of estimator to be used to normalize for differences in library size: (“ 
init.type 
Type of initialization strategy to be used (“ 
gmin.init.type 
Type of initialization strategy to be used for the minimum number of clusters in a sequence ( 
init.runs 
Number of runs to be used for the SmallEM strategy described in Rau et al. (2011), with a default value of 1 
init.iter 
Number of iterations to be used within each run for the SmallEM strategry, with a default value of 10 
split.init 
If 
alg.type 
Algorithm to be used for parameter estimation (“ 
cutoff 
Cutoff to declare algorithm convergence (in terms of differences in log likelihoods from one iteration to the next) 
iter 
Maximum number of iterations to be run for the chosen algorithm 
fixed.lambda 
If one (or more) clusters with fixed values of lambda is desired, a list containing vectors of length d (the number of conditions). specifying the fixed values of lambda for each fixed cluster. 
equal.proportions 
If 
prev.labels 
A vector of length n of cluster labels obtained from the previous run (g1 clusters) to be used with the splitting smallEM strategy described in described in Papastamoulis et al. (2014). For other initialization strategies, this parameter takes the value NA 
prev.probaPost 
An n x (g1) matrix of the conditional probabilities of each observation belonging to each of the g1 clusters from the previous run, to be used with the splitting smallEM strategy of described in Papastamoulis et al. (2012). For other initialization strategies, this parameter takes the value NA 
verbose 
If 
interpretation 
If 
EM.verbose 
If 
subset.index 
Optional vector providing the indices of a subset of genes that should be used for the coexpression analysis (i.e., row indices
of the data matrix 
wrapper 

Details
Output of PoisMixClus
is an S3 object of class HTSCluster
, and output of PoisMixClusWrapper
is an S3 object
of class HTSClusterWrapper
.
In a Poisson mixture model, the data y are assumed to come from g distinct subpopulations (clusters), each of which is modeled separately; the overall population is thus a mixture of these subpopulations. In the case of a Poisson mixture model with g components, the model may be written as
f(y;g,ψ_g) = ∏_{i=1}^n ∑_{k=1}^g π_k ∏_{j=1}^{d}∏_{l=1}^{r_j} P(y_{ijl} ; θ_k)
for i = 1, …, n observations in l = 1, …, r_j replicates of j = 1, …, d conditions (treatment groups), where P(\cdot) is the standard Poisson density, ψ_g = (π_1,…,π_{g1}, θ^\prime), θ^\prime contains all of the parameters in θ_1,…,θ_g assumed to be distinct, and π = (π_1,…,π_g)^\prime are the mixing proportions such that π_k is in (0,1) for all k and ∑_k π_k = 1.
We consider the following parameterization for the mean θ = (mu_{ijlk}). We consider
μ_{ijlk} = w_i s_{jl} λ_{jk}
where w_i corresponds to the expression level of observation i, λ_k = (λ_{1k},…,λ_{dk}) corresponds to the clustering parameters that define the profiles of the genes in cluster k across all variables, and s_{jl} is the normalized library size (a fixed constant) for replicate l of condition j.
There are two approaches to estimating the parameters of a finite mixture model and obtaining a clustering of the data: the estimation approach (via the EM algorithm) and the clustering approach (via the CEM algorithm). Parameter initialization is done using a SmallEM strategy as described in Rau et al. (2011) via the emInit
function. Model selection may be performed using the BIC or ICL criteria, or the slope heuristics.
Value
lambda 
(d x g) matrix containing the estimate of \hat{λ} 
pi 
Vector of length g containing the estimate of \hat{π} 
labels 
Vector of length n containing the cluster assignments of the n observations 
probaPost 
Matrix containing the conditional probabilities of belonging to each cluster for all observations 
log.like 
Value of log likelihood 
BIC 
Value of BIC criterion 
ICL 
Value of ICL criterion 
alg.type 
Estimation algorithm used; matches the argument 
norm 
Library size normalization factors used 
conds 
Conditions specified by user 
iterations 
Number of iterations run 
logLikeDiff 
Difference in loglikelihood between the last and penultimate iterations of the algorithm 
subset.index 
If provided by the user, the indices of subset of genes used for coexpression analyses 
loglike.all 
Log likelihoods calculated for each of the fitted models for cluster sizes 
capushe 
Results of capushe model selection, an object of class 
ICL.all 
ICL values calculated for each of the fitted models for cluster sizes 
ICL.results 
Object of class 
BIC.results 
Object of class 
DDSE.results 
Object of class 
Djump.results 
Object of class 
all.results 
List of objects of class 
model.selection 
Type of criteria used for model selection, equal to 
Note
Note that the fixed.lambda
argument is primarily intended to be used in the case when a single cluster is fixed to
have equal clustering parameters lambda across all conditions (i.e., λ_{j1}=λ_{1}=1); this is particularly useful
when identifying genes with nondifferential expression across all conditions (see the HTSDiff
R package for more details).
Alternatively, this argument could be used to specify a cluster for which genes are only expressed in a single condition
(e.g., λ_{11} = 1 and λ_{j1} = 0 for all j > 1). Other possibilities could be considered,
but note that the fixed values of lambda must satisfy the constraint ∑_j λ_{jk}s_{j.} = 1 for all k
imposed in the model; if this is not the case, a warning message will be printed.
Author(s)
Andrea Rau <andrea.rau@jouy.inra.fr>
References
Anders, S. and Huber, W. (2010) Differential expression analysis for sequence count data. Genome Biology, 11(R106), 128.
Papastamoulis, P., MartinMagniette, M.L., and MaugisRabusseau, C. (2014). On the estimation of mixtures of Poisson regression models with large number of components. Computational Statistics and Data Analysis: 3rd special Issue on Advances in Mixture Models, DOI: 10.1016/j.csda.2014.07.005.
Rau, A., MaugisRabusseau, C., MartinMagniette, M.L., Celeux G. (2015). Coexpression analysis of highthroughput transcriptome sequencing data with Poisson mixture models. Bioinformatics, 31(9):14201427.
Rau, A., Celeux, G., MartinMagniette, M.L., MaugisRabusseau, C (2011). Clustering highthroughput sequencing data with Poisson mixture models. Inria Research Report 7786. Available at http://hal.inria.fr/inria00638082.
See Also
probaPost
for the calculation of the conditional probability of belonging to a cluster;
PoisMixMean
for the calculation of the percluster conditional mean of each observation;
logLikePoisMixDiff
for the calculation of the log likelihood of a Poisson mixture model;
emInit
and kmeanInit
for the SmallEM parameter initialization strategy
Examples
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52  set.seed(12345)
## Simulate data as shown in Rau et al. (2011)
## Library size setting "A", high cluster separation
## n = 200 observations
simulate < PoisMixSim(n = 200, libsize = "A", separation = "high")
y < simulate$y
conds < simulate$conditions
## Run the PMM model for g = 3
## "TC" library size estimate, EM algorithm
run < PoisMixClus(y, g = 3, conds = conds, norm = "TC")
## Estimates of pi and lambda for the selected model
pi.est < run$pi
lambda.est < run$lambda
## Not run: PMM for 4 total clusters, with one fixed class
## "TC" library size estimate, EM algorithm
##
## run < PoisMixClus(y, g = 3, norm = "TC", conds = conds,
## fixed.lambda = list(c(1,1,1)))
##
##
## Not run: PMM model for 4 clusters, with equal proportions
## "TC" library size estimate, EM algorithm
##
## run < PoisMixClus(y, g = 4, norm = "TC", conds = conds,
## equal.proportions = TRUE)
##
##
## Not run: PMM model for g = 1, ..., 10 clusters, Split SmallEM init
##
## run1.10 < PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds,
## norm = "TC")
##
##
## Not run: PMM model for g = 1, ..., 10 clusters, SmallEM init
##
## run1.10bis < < PoisMixClusWrapper(y, gmin = 1, gmax = 10, conds = conds,
## norm = "TC", split.init = FALSE)
##
##
## Not run: previous model equivalent to the following
##
## for(K in 1:10) {
## run < PoisMixClus(y, g = K, conds = conds, norm = "TC")
## }
