# ClustMMDD-package: 'ClustMMDD' : Clustering by Mixture Models for Discrete Data. In ClustMMDD: Variable Selection in Clustering by Mixture Models for Discrete Data

## Description

ClustMMDD stands for "Clustering by Mixture Models for Discrete Data". This package deals with the two-fold problem of variable selection and model-based unsupervised classification in discrete settings. Variable selection and classification are simultaneously solved via a model selection procedure using penalized criteria: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Integrated Completed Log-likelihood (ICL) or a general criterion with penalty function to be data-driven calibrated.

## Details

 Package: ClustMMDD Type: Package Version: 1.0.1 Date: 2015-05-18 License: GPL (>= 2)

In this package, K and S are respectively the number of clusters and the subset of variables that are relevant for clustering purposes. We assume that a clustering variable has different probability distributions in at least two clusters, and a non-clustering variable has the same distribution in all clusters. We consider a general situation with data described by P random variables X^l, l=1,\cdots,P, where each variable X^l is an unordered set ≤ft\{X^{l,1},\cdots,X^{l,ploidy}\right\} of ploidy categorical variables. For all l, the random variables X^{l,1},\cdots,X^{l,ploidy} take their values in the same set of levels. A typical example of such data comes from population genetics where each genotype of a diploid individual is constituted by ploidy = 2 unordered alleles.

The two-fold problem of clustering and variable selection is seen as a model selection problem. A specific collection of competing models associated to different values of (K, S) is defined, and are compared using penalized criteria. The penalized criteria are of the form

crit≤ft(K,S\right)=γ_n≤ft(K,S\right)+pen≤ft(K,S\right),

where

• γ_n≤ft(K,S\right) is the maximum log-likelihood,

• and pen≤ft(K,S\right) the penalty function.

The penalty functions used in this package are the following, where dim≤ft(K,S\right) is the dimension (number of free parameters) of the model defined by ≤ft(K,S\right) :

• Akaike Information Criterion (AIC) :

pen≤ft(K,S\right) = dim≤ft(K,S\right)

• Bayesian Information (BIC) :

pen≤ft(K,S\right) = 0.5*\log (n)*dim≤ft(K,S\right)

• Integrated Complete Likelihood (ICL) :

pen≤ft(K,S\right) = 0.5*\log (n)*dim≤ft(K,S\right)+entropy≤ft(K,S\right),

where

entropy≤ft(K,S\right) = -∑_{i=1}^N∑_{k=1}^Kτ_{i,k}\log≤ft(τ_{i,k}\right)

and

τ_{i,k}=P≤ft(i\in\mathcal{C}_k\right)

.

• More general penalty function :

pen≤ft(K,S\right) = α*λ*dim≤ft(K,S\right)

where

• λ is a multiplicative parameter to be calibrated,

• α a coefficient in [1.5,2] to be given by the user.

We propose a data driven procedure based the dimension jumb version of the so called "slope heuristics" (see Dominique Bontemps and Wilson Toussile (2013) and references therein).

The maximum log-likelihood is estimated via the Expectation and Maximisation algorithm. The maximum a posteriori classification is derived from the estimated parameters of the selected model.

## Author(s)

Wilson Toussile

Maintainer: Wilson Toussile <wilson.toussile@gmail.com>

## References

The main functions :

em.cluster.R

Compute an approximation of the maximum likelihood estimates of parameters using Expectation and Maximization algorithm, for a given value of (K,S). The maximum a posteriori classification is then derived.

backward.explorer

Gather the most competitive models using a backward-stepwise strategy.

dimJump.R

Perform the data driven calibration of the penalty function via an estimation of λ. Two values are proposed and a graphic is proposed to help user in making a choice.

selectK.R

Perform the selection of the number K of clusters for a given subset of clustering variables.

model.selection.R

Perform a model selection from a collection of competing models.

## Examples

  1 2 3 4 5 6 7 8 9 10 data(genotype2) head(genotype2) data(genotype2_ExploredModels) head(genotype2_ExploredModels) #Calibration of the penalty function outDimJump = dimJump.R(genotype2_ExploredModels, N = 1000, h = 5, header = TRUE) cte1 = outDimJump[[1]][1] outSlection = model.selection.R(genotype2_ExploredModels, cte = cte1, header = TRUE) outSlection 

### Example output

Loading required package: Rcpp

ClustMMDD = Clustering by Mixture Models for Discrete Data.

Version 1.0.4

ClustMMDD is the R version of the stand alone c++ package named 'MixMoGenD'

that is available on www.u-psud.fr/math/~toussile.

[,1]  [,2]  [,3]  [,4]  [,5]  [,6]  [,7]  [,8]  [,9]  [,10] [,11] [,12]
[1,] "109" "103" "108" "105" "107" "107" "109" "110" "107" "101" "110" "105"
[2,] "107" "106" "105" "105" "103" "104" "108" "108" "104" "105" "104" "104"
[3,] "105" "103" "101" "108" "110" "108" "106" "103" "101" "106" "107" "103"
[4,] "101" "107" "107" "107" "108" "101" "102" "105" "107" "110" "110" "101"
[5,] "106" "107" "110" "105" "103" "102" "109" "101" "103" "103" "109" "101"
[6,] "106" "109" "108" "103" "102" "106" "105" "109" "104" "107" "103" "105"
[,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,] "101" "101" "102" "110" "109" "102" "105" "105"
[2,] "104" "105" "107" "105" "109" "107" "101" "108"
[3,] "105" "106" "108" "103" "103" "109" "105" "109"
[4,] "107" "109" "103" "110" "108" "105" "108" "105"
[5,] "110" "110" "101" "109" "102" "104" "103" "103"
[6,] "109" "101" "110" "107" "105" "104" "103" "110"
N  P K S1 S2 S3 S4 S5 S6 S7 S8 S9 S10    logLik dim   entropy
1 1000 10 1  0  0  0  0  0  0  0  0  0   0 -39277.97  90   0.00000
2 1000 10 2  1  1  1  1  1  1  1  1  1   1 -38896.99 181  93.50989
3 1000 10 2  0  1  1  1  1  1  1  1  1   1 -38993.02 172 123.88414
4 1000 10 2  1  0  1  1  1  1  1  1  1   1 -38988.61 172 226.60695
5 1000 10 2  1  1  0  1  1  1  1  1  1   1 -38951.55 172 119.54232
6 1000 10 2  1  1  1  0  1  1  1  1  1   1 -38988.33 172 149.27192
N  P K S1 S2 S3 S4 S5 S6 S7 S8 S9 S10    logLik dim  entropy criteria
BIC    1000 10 5  1  1  1  1  1  1  0  0  0   0 -37995.22 310 205.3947 39065.93
AIC    1000 10 5  1  1  1  1  1  1  1  1  0   0 -37843.71 382 179.5732 38225.71
ICL    1000 10 5  1  1  1  1  1  1  0  0  0   0 -37995.22 310 205.3947 39271.32
CteDim 1000 10 5  1  1  1  1  1  1  1  1  0   0 -37843.71 382 179.5732 38474.01


ClustMMDD documentation built on May 2, 2019, 2:44 p.m.