Description Details Author(s) References See Also Examples
ClustMMDD
stands for "Clustering by Mixture Models for Discrete Data". This package deals with the two-fold problem of variable selection and model-based unsupervised classification in discrete settings. Variable selection and classification are simultaneously solved via a model selection procedure using penalized criteria: Akaike Information Criterion (AIC), Bayesian Information Criterion (BIC), Integrated Completed Log-likelihood (ICL) or a general criterion with penalty function to be data-driven calibrated.
Package: | ClustMMDD |
Type: | Package |
Version: | 1.0.1 |
Date: | 2015-05-18 |
License: | GPL (>= 2) |
In this package, K
and S
are respectively the number of clusters and the subset of variables that are relevant for clustering purposes. We assume that a clustering variable has different probability distributions in at least two clusters, and a non-clustering variable has the same distribution in all clusters. We consider a general situation with data described by P random variables X^l, l=1,\cdots,P, where each variable X^l is an unordered set ≤ft\{X^{l,1},\cdots,X^{l,ploidy}\right\} of ploidy categorical variables. For all l, the random variables X^{l,1},\cdots,X^{l,ploidy} take their values in the same set of levels. A typical example of such data comes from population genetics where each genotype of a diploid individual is constituted by ploidy = 2 unordered alleles.
The two-fold problem of clustering and variable selection is seen as a model selection problem. A specific collection of competing models associated to different values of (K, S)
is defined, and are compared using penalized criteria. The penalized criteria are of the form
crit≤ft(K,S\right)=γ_n≤ft(K,S\right)+pen≤ft(K,S\right),
where
γ_n≤ft(K,S\right) is the maximum log-likelihood,
and pen≤ft(K,S\right) the penalty function.
The penalty functions used in this package are the following, where dim≤ft(K,S\right) is the dimension (number of free parameters) of the model defined by ≤ft(K,S\right) :
Akaike Information Criterion (AIC) :
pen≤ft(K,S\right) = dim≤ft(K,S\right)
Bayesian Information (BIC) :
pen≤ft(K,S\right) = 0.5*\log (n)*dim≤ft(K,S\right)
Integrated Complete Likelihood (ICL) :
pen≤ft(K,S\right) = 0.5*\log (n)*dim≤ft(K,S\right)+entropy≤ft(K,S\right),
where
entropy≤ft(K,S\right) = -∑_{i=1}^N∑_{k=1}^Kτ_{i,k}\log≤ft(τ_{i,k}\right)
and
τ_{i,k}=P≤ft(i\in\mathcal{C}_k\right)
.
More general penalty function :
pen≤ft(K,S\right) = α*λ*dim≤ft(K,S\right)
where
λ is a multiplicative parameter to be calibrated,
α a coefficient in [1.5,2] to be given by the user.
We propose a data driven procedure based the dimension jumb version of the so called "slope heuristics" (see Dominique Bontemps and Wilson Toussile (2013) and references therein).
The maximum log-likelihood is estimated via the Expectation and Maximisation algorithm. The maximum a posteriori classification is derived from the estimated parameters of the selected model.
Wilson Toussile
Maintainer: Wilson Toussile <wilson.toussile@gmail.com>
Dominique Bontemps and Wilson Toussile (2013) : Clustering and variable selection for categorical multivariate data. Electronic Journal of Statistics, Volume 7, 2344-2371, ISSN.
Wilson Toussile and Elisabeth Gassiat (2009) : Variable selection in model-based clustering using multilocus genotype data. Adv Data Anal Classif, Vol 3, number 2, 109-134.
The main functions :
em.cluster.R
Compute an approximation of the maximum likelihood estimates of parameters using Expectation and Maximization algorithm, for a given value of (K,S). The maximum a posteriori classification is then derived.
backward.explorer
Gather the most competitive models using a backward-stepwise strategy.
dimJump.R
Perform the data driven calibration of the penalty function via an estimation of λ. Two values are proposed and a graphic is proposed to help user in making a choice.
selectK.R
Perform the selection of the number K of clusters for a given subset of clustering variables.
model.selection.R
Perform a model selection from a collection of competing models.
1 2 3 4 5 6 7 8 9 10 | data(genotype2)
head(genotype2)
data(genotype2_ExploredModels)
head(genotype2_ExploredModels)
#Calibration of the penalty function
outDimJump = dimJump.R(genotype2_ExploredModels, N = 1000, h = 5, header = TRUE)
cte1 = outDimJump[[1]][1]
outSlection = model.selection.R(genotype2_ExploredModels, cte = cte1, header = TRUE)
outSlection
|
Loading required package: Rcpp
ClustMMDD = Clustering by Mixture Models for Discrete Data.
Version 1.0.4
ClustMMDD is the R version of the stand alone c++ package named 'MixMoGenD'
that is available on www.u-psud.fr/math/~toussile.
initializing ... Loaded
[,1] [,2] [,3] [,4] [,5] [,6] [,7] [,8] [,9] [,10] [,11] [,12]
[1,] "109" "103" "108" "105" "107" "107" "109" "110" "107" "101" "110" "105"
[2,] "107" "106" "105" "105" "103" "104" "108" "108" "104" "105" "104" "104"
[3,] "105" "103" "101" "108" "110" "108" "106" "103" "101" "106" "107" "103"
[4,] "101" "107" "107" "107" "108" "101" "102" "105" "107" "110" "110" "101"
[5,] "106" "107" "110" "105" "103" "102" "109" "101" "103" "103" "109" "101"
[6,] "106" "109" "108" "103" "102" "106" "105" "109" "104" "107" "103" "105"
[,13] [,14] [,15] [,16] [,17] [,18] [,19] [,20]
[1,] "101" "101" "102" "110" "109" "102" "105" "105"
[2,] "104" "105" "107" "105" "109" "107" "101" "108"
[3,] "105" "106" "108" "103" "103" "109" "105" "109"
[4,] "107" "109" "103" "110" "108" "105" "108" "105"
[5,] "110" "110" "101" "109" "102" "104" "103" "103"
[6,] "109" "101" "110" "107" "105" "104" "103" "110"
N P K S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 logLik dim entropy
1 1000 10 1 0 0 0 0 0 0 0 0 0 0 -39277.97 90 0.00000
2 1000 10 2 1 1 1 1 1 1 1 1 1 1 -38896.99 181 93.50989
3 1000 10 2 0 1 1 1 1 1 1 1 1 1 -38993.02 172 123.88414
4 1000 10 2 1 0 1 1 1 1 1 1 1 1 -38988.61 172 226.60695
5 1000 10 2 1 1 0 1 1 1 1 1 1 1 -38951.55 172 119.54232
6 1000 10 2 1 1 1 0 1 1 1 1 1 1 -38988.33 172 149.27192
N P K S1 S2 S3 S4 S5 S6 S7 S8 S9 S10 logLik dim entropy criteria
BIC 1000 10 5 1 1 1 1 1 1 0 0 0 0 -37995.22 310 205.3947 39065.93
AIC 1000 10 5 1 1 1 1 1 1 1 1 0 0 -37843.71 382 179.5732 38225.71
ICL 1000 10 5 1 1 1 1 1 1 0 0 0 0 -37995.22 310 205.3947 39271.32
CteDim 1000 10 5 1 1 1 1 1 1 1 1 0 0 -37843.71 382 179.5732 38474.01
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.