Description Details Author(s) References See Also Examples
ClustMMDD
stands for "Clustering by Mixture Models for Discrete Data". This package deals with the twofold problem of variable selection and modelbased unsupervised classification in discrete settings. Variable selection and classification are simultaneously solved via a model selection procedure using penalized criteria: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Integrated Completed Loglikelihood (ICL) or a general criterion with penalty function to be datadriven calibrated.
Package:  ClustMMDD 
Type:  Package 
Version:  1.0.1 
Date:  20150518 
License:  GPL (>= 2) 
In this package, K
and S
are respectively the number of clusters and the subset of variables that are relevant for clustering purposes. We assume that a clustering variable has different probability distributions in at least two clusters, and a nonclustering variable has the same distribution in all clusters. We consider a general situation with data described by P random variables X^l, l=1,\cdots,P, where each variable X^l is an unordered set ≤ft\{X^{l,1},\cdots,X^{l,ploidy}\right\} of ploidy categorical variables. For all l, the random variables X^{l,1},\cdots,X^{l,ploidy} take their values in the same set of levels. A typical example of such data comes from population genetics where each genotype of a diploid individual is constituted by ploidy = 2 unordered alleles.
The twofold problem of clustering and variable selection is seen as a model selection problem. A specific collection of competing models associated to different values of (K, S)
is defined, and are compared using penalized criteria. The penalized criteria are of the form
crit≤ft(K,S\right)=γ_n≤ft(K,S\right)+pen≤ft(K,S\right),
where
γ_n≤ft(K,S\right) is the maximum loglikelihood,
and pen≤ft(K,S\right) the penalty function.
The penalty functions used in this package are the following, where dim≤ft(K,S\right) is the dimension (number of free parameters) of the model defined by ≤ft(K,S\right) :
Akaike Information Criterion (AIC) :
pen≤ft(K,S\right) = dim≤ft(K,S\right)
Bayesian Information (BIC) :
pen≤ft(K,S\right) = 0.5*\log (n)*dim≤ft(K,S\right)
Integrated Complete Likelihood (ICL) :
pen≤ft(K,S\right) = 0.5*\log (n)*dim≤ft(K,S\right)+entropy≤ft(K,S\right),
where
entropy≤ft(K,S\right) = ∑_{i=1}^N∑_{k=1}^Kτ_{i,k}\log≤ft(τ_{i,k}\right)
and
τ_{i,k}=P≤ft(i\in\mathcal{C}_k\right)
.
More general penalty function :
pen≤ft(K,S\right) = α*λ*dim≤ft(K,S\right)
where
λ is a multiplicative parameter to be calibrated,
α a coefficient in [1.5,2] to be given by the user.
We propose a data driven procedure based the dimension jumb version of the so called "slope heuristics" (see Dominique Bontemps and Wilson Toussile (2013) and references therein).
The maximum loglikelihood is estimated via the Expectation and Maximisation algorithm. The maximum a posteriori classification is derived from the estimated parameters of the selected model.
Wilson Toussile
Maintainer: Wilson Toussile <wilson.toussile@gmail.com>
Dominique Bontemps and Wilson Toussile (2013) : Clustering and variable selection for categorical multivariate data. Electronic Journal of Statistics, Volume 7, 23442371, ISSN.
Wilson Toussile and Elisabeth Gassiat (2009) : Variable selection in modelbased clustering using multilocus genotype data. Adv Data Anal Classif, Vol 3, number 2, 109134.
The main functions :
em.cluster.R
Compute an approximation of the maximum likelihood estimates of parameters using Expectation and Maximization algorithm, for a given value of (K,S). The maximum a posteriori classification is then derived.
backward.explorer
Gather the most competitive models using a backwardstepwise strategy.
dimJump.R
Perform the data driven calibration of the penalty function via an estimation of λ. Two values are proposed and a graphic is proposed to help user in making a choice.
selectK.R
Perform the selection of the number K of clusters for a given subset of clustering variables.
model.selection.R
Perform a model selection from a collection of competing models.
1 2 3 4 5 6 7 8 9 10  data(genotype2)
head(genotype2)
data(genotype2_ExploredModels)
head(genotype2_ExploredModels)
#Calibration of the penalty function
outDimJump = dimJump.R(genotype2_ExploredModels, N = 1000, h = 5, header = TRUE)
cte1 = outDimJump[[1]][1]
outSlection = model.selection.R(genotype2_ExploredModels, cte = cte1, header = TRUE)
outSlection

Loading required package: Rcpp
ClustMMDD = Clustering by Mixture Models for Discrete Data.
Version 1.0.4
ClustMMDD is the R version of the stand alone c++ package named 'MixMoGenD'
that is available on www.upsud.fr/math/~toussile.
initializing ... Loaded
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "109" "103" "108" "105" "107" "107" "109" "110" "107" "101" "110" "105"
[2,] "107" "106" "105" "105" "103" "104" "108" "108" "104" "105" "104" "104"
[3,] "105" "103" "101" "108" "110" "108" "106" "103" "101" "106" "107" "103"
[4,] "101" "107" "107" "107" "108" "101" "102" "105" "107" "110" "110" "101"
[5,] "106" "107" "110" "105" "103" "102" "109" "101" "103" "103" "109" "101"
[6,] "106" "109" "108" "103" "102" "106" "105" "109" "104" "107" "103" "105"
[,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,] "101" "101" "102" "110" "109" "102" "105" "105"
[2,] "104" "105" "107" "105" "109" "107" "101" "108"
[3,] "105" "106" "108" "103" "103" "109" "105" "109"
[4,] "107" "109" "103" "110" "108" "105" "108" "105"
[5,] "110" "110" "101" "109" "102" "104" "103" "103"
[6,] "109" "101" "110" "107" "105" "104" "103" "110"
N P K S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 logLik dim entropy
1 1000 10 1 0 0 0 0 0 0 0 0 0 0 39277.97 90 0.00000
2 1000 10 2 1 1 1 1 1 1 1 1 1 1 38896.99 181 93.50989
3 1000 10 2 0 1 1 1 1 1 1 1 1 1 38993.02 172 123.88414
4 1000 10 2 1 0 1 1 1 1 1 1 1 1 38988.61 172 226.60695
5 1000 10 2 1 1 0 1 1 1 1 1 1 1 38951.55 172 119.54232
6 1000 10 2 1 1 1 0 1 1 1 1 1 1 38988.33 172 149.27192
N P K S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 logLik dim entropy criteria
BIC 1000 10 5 1 1 1 1 1 1 0 0 0 0 37995.22 310 205.3947 39065.93
AIC 1000 10 5 1 1 1 1 1 1 1 1 0 0 37843.71 382 179.5732 38225.71
ICL 1000 10 5 1 1 1 1 1 1 0 0 0 0 37995.22 310 205.3947 39271.32
CteDim 1000 10 5 1 1 1 1 1 1 1 1 0 0 37843.71 382 179.5732 38474.01
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.