ClustMMDD-package: 'ClustMMDD' : Clustering by Mixture Models for Discrete Data.
In ClustMMDD: Variable Selection in Clustering by Mixture Models for Discrete Data

Description Details Author(s) References See Also Examples

ClustMMDD stands for "Clustering by Mixture Models for Discrete Data". This package deals with the two-fold problem of variable selection and model-based unsupervised classification in discrete settings. Variable selection and classification are simultaneously solved via a model selection procedure using penalized criteria: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Integrated Completed Log-likelihood (ICL) or a general criterion with penalty function to be data-driven calibrated.

Package:	ClustMMDD
Type:	Package
Version:	1.0.1
Date:	2015-05-18
License:	GPL (>= 2)

In this package, K and S are respectively the number of clusters and the subset of variables that are relevant for clustering purposes. We assume that a clustering variable has different probability distributions in at least two clusters, and a non-clustering variable has the same distribution in all clusters. We consider a general situation with data described by P random variables X^l, l=1,\cdots,P, where each variable X^l is an unordered set ≤ft\{X^{l,1},\cdots,X^{l,ploidy}\right\} of ploidy categorical variables. For all l, the random variables X^{l,1},\cdots,X^{l,ploidy} take their values in the same set of levels. A typical example of such data comes from population genetics where each genotype of a diploid individual is constituted by ploidy = 2 unordered alleles.

The two-fold problem of clustering and variable selection is seen as a model selection problem. A specific collection of competing models associated to different values of (K, S) is defined, and are compared using penalized criteria. The penalized criteria are of the form

crit≤ft(K,S\right)=γ_n≤ft(K,S\right)+pen≤ft(K,S\right),

where

γ_n≤ft(K,S\right) is the maximum log-likelihood,
and pen≤ft(K,S\right) the penalty function.

The penalty functions used in this package are the following, where dim≤ft(K,S\right) is the dimension (number of free parameters) of the model defined by ≤ft(K,S\right) :

Akaike Information Criterion (AIC) :

pen≤ft(K,S\right) = dim≤ft(K,S\right)
Bayesian Information (BIC) :

pen≤ft(K,S\right) = 0.5*\log (n)*dim≤ft(K,S\right)
Integrated Complete Likelihood (ICL) :

pen≤ft(K,S\right) = 0.5*\log (n)*dim≤ft(K,S\right)+entropy≤ft(K,S\right),

where

entropy≤ft(K,S\right) = -∑_{i=1}^N∑_{k=1}^Kτ_{i,k}\log≤ft(τ_{i,k}\right)

and

τ_{i,k}=P≤ft(i\in\mathcal{C}_k\right)

.
More general penalty function :

pen≤ft(K,S\right) = α*λ*dim≤ft(K,S\right)

where
- λ is a multiplicative parameter to be calibrated,
- α a coefficient in [1.5,2] to be given by the user.
We propose a data driven procedure based the dimension jumb version of the so called "slope heuristics" (see Dominique Bontemps and Wilson Toussile (2013) and references therein).

The maximum log-likelihood is estimated via the Expectation and Maximisation algorithm. The maximum a posteriori classification is derived from the estimated parameters of the selected model.

Wilson Toussile

Maintainer: Wilson Toussile <wilson.toussile@gmail.com>

Dominique Bontemps and Wilson Toussile (2013) : Clustering and variable selection for categorical multivariate data. Electronic Journal of Statistics, Volume 7, 2344-2371, ISSN.
Wilson Toussile and Elisabeth Gassiat (2009) : Variable selection in model-based clustering using multilocus genotype data. Adv Data Anal Classif, Vol 3, number 2, 109-134.

The main functions :

em.cluster.R: Compute an approximation of the maximum likelihood estimates of parameters using Expectation and Maximization algorithm, for a given value of (K,S). The maximum a posteriori classification is then derived.
backward.explorer: Gather the most competitive models using a backward-stepwise strategy.
dimJump.R: Perform the data driven calibration of the penalty function via an estimation of λ. Two values are proposed and a graphic is proposed to help user in making a choice.
selectK.R: Perform the selection of the number K of clusters for a given subset of clustering variables.
model.selection.R: Perform a model selection from a collection of competing models.

data(genotype2)
head(genotype2)
data(genotype2_ExploredModels)
head(genotype2_ExploredModels)

#Calibration of the penalty function
outDimJump = dimJump.R(genotype2_ExploredModels, N = 1000, h = 5, header = TRUE)
cte1 = outDimJump[[1]][1]
outSlection = model.selection.R(genotype2_ExploredModels, cte = cte1, header = TRUE)
outSlection

Loading required package: Rcpp

 ClustMMDD = Clustering by Mixture Models for Discrete Data.
  
 Version 1.0.4
  
 ClustMMDD is the R version of the stand alone c++ package named 'MixMoGenD'
  
   that is available on www.u-psud.fr/math/~toussile.

 initializing ... Loaded 

     [,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10] [,11] [,12]
[1,] "109" "103" "108" "105" "107" "107" "109" "110" "107" "101" "110" "105"
[2,] "107" "106" "105" "105" "103" "104" "108" "108" "104" "105" "104" "104"
[3,] "105" "103" "101" "108" "110" "108" "106" "103" "101" "106" "107" "103"
[4,] "101" "107" "107" "107" "108" "101" "102" "105" "107" "110" "110" "101"
[5,] "106" "107" "110" "105" "103" "102" "109" "101" "103" "103" "109" "101"
[6,] "106" "109" "108" "103" "102" "106" "105" "109" "104" "107" "103" "105"
     [,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,] "101" "101" "102" "110" "109" "102" "105" "105"
[2,] "104" "105" "107" "105" "109" "107" "101" "108"
[3,] "105" "106" "108" "103" "103" "109" "105" "109"
[4,] "107" "109" "103" "110" "108" "105" "108" "105"
[5,] "110" "110" "101" "109" "102" "104" "103" "103"
[6,] "109" "101" "110" "107" "105" "104" "103" "110"
     N  P K S1 S2 S3 S4 S5 S6 S7 S8 S9 S10    logLik dim   entropy
1 1000 10 1  0  0  0  0  0  0  0  0  0   0 -39277.97  90   0.00000
2 1000 10 2  1  1  1  1  1  1  1  1  1   1 -38896.99 181  93.50989
3 1000 10 2  0  1  1  1  1  1  1  1  1   1 -38993.02 172 123.88414
4 1000 10 2  1  0  1  1  1  1  1  1  1   1 -38988.61 172 226.60695
5 1000 10 2  1  1  0  1  1  1  1  1  1   1 -38951.55 172 119.54232
6 1000 10 2  1  1  1  0  1  1  1  1  1   1 -38988.33 172 149.27192
          N  P K S1 S2 S3 S4 S5 S6 S7 S8 S9 S10    logLik dim  entropy criteria
BIC    1000 10 5  1  1  1  1  1  1  0  0  0   0 -37995.22 310 205.3947 39065.93
AIC    1000 10 5  1  1  1  1  1  1  1  1  0   0 -37843.71 382 179.5732 38225.71
ICL    1000 10 5  1  1  1  1  1  1  0  0  0   0 -37995.22 310 205.3947 39271.32
CteDim 1000 10 5  1  1  1  1  1  1  1  1  0   0 -37843.71 382 179.5732 38474.01