Rmixmod-package: Rmixmod a MIXture MODelling package

Description Details Author(s) References Examples

Description

Rmixmod is a package based on the existing MIXMOD software. MIXMOD is a tool for fitting a mixture model of multivariate gaussian or multinomial components to a given data set with either a clustering, a density estimation or a discriminant analysis point of view.

Details

The general purpose of the package is to discover, or explain, group structures in multivariate data sets with unknown (cluster analysis or clustering) or known class discriminant analysis or classification). It is an exploratory data analysis tool for solving clustering and classification problems. But it can also be regarded as a semi-parametric tool to estimate densities with Gaussian mixture distributions and multinomial distributions.

Mathematically, mixture probability density function (pdf) f is a weighted sum of K components densities :

f({\bf x}_i|θ) = ∑_{k=1}^{K}p_kh({\bf x}_i|λ_k)

where h(.|{λ}_k) denotes a d-dimensional distribution parametrized by λ_k. The parameters are the mixing proportions p_k and the component of the distribution λ_k.

In the Gaussian case, h is the density of a Gaussian distribution with mean μ_k and variance matrix Σ_k, and thus λ_k = (μ_k,Σ_k).

In the qualitative case, h is a multinomial distribution and λ_k=(a_k,ε_k) is the parameter of the distribution.

Estimation of the mixture parameters is performed either through maximum likelihood via the EM (Expectation Maximization, Dempster et al. 1977), the SEM (Stochastic EM, Celeux and Diebolt 1985) algorithm or through classification maximum likelihood via the CEM algorithm (Clustering EM, Celeux and Govaert 1992). These three algorithms can be chained to obtain original fitting strategies (e.g. CEM then EM with results of CEM) to use advantages of each of them in the estimation process. As mixture problems usually have multiple relative maxima, the program will produce different results, depending on the initial estimates supplied by the user. If the user does not input his own initial estimates, some initial estimates procedures are proposed (random centers for instance).

It is possible to constrain some input parameters. For example, dispersions can be equal between classes, etc.

In the Gaussian case, fourteen models are implemented. They are based on the eigenvalue decomposition, are most generally used. They depend on constraints on the variance matrix such as same variance matrix between clusters, spherical variance matrix... and they are suitable for data sets in any dimension.

In the qualitative case, five multinomial models are available. They are based on a reparametrization of the multinomial probabilities.

In both cases, the models and the number of clusters can be chosen by different criteria : BIC (Bayesian Information Criterion), ICL (Integrated Completed Likelihood, a classification version of BIC), NEC (Entropy Criterion), or Cross-Validation (CV).

Author(s)

Author: Florent Langrognet and Remi Lebret and Christian Poli ans Serge Iovleff, with contributions from C. Biernacki and G. Celeux and G. Govaert [email protected]

References

Biernacki C., Celeux G., Govaert G., Langrognet F., 2006. "Model-Based Cluster and Discriminant Analysis with the MIXMOD Software". Computational Statistics and Data Analysis, vol. 51/2, pp. 587-600.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
  ## Not run: 
  ## Clustering Analysis
  # load quantitative data set
  data(geyser)
  # Clustering in gaussian case
  xem1<-mixmodCluster(geyser,3)
  summary(xem1)
  plot(xem1)
  hist(xem1)

  # load qualitative data set
  data(birds)
  # Clustering in multinomial case
  xem2<-mixmodCluster(birds, 2)
  summary(xem2)
  barplot(xem2)

  # load heterogeneous data set
  data(finance)
  # Clustering in composite case
  xem3<-mixmodCluster(finance,2:6)
  summary(xem3)

  ## Discriminant Analysis
  # start by extract 10 observations from iris data set
  remaining.obs<-sample(1:nrow(iris),10)
  # then run a mixmodLearn() analysis without those 10 observations
  learn<-mixmodLearn(iris[-remaining.obs,1:4], iris$Species[-remaining.obs])
  # create a MixmodPredict to predict those 10 observations
  prediction <- mixmodPredict(data=iris[remaining.obs,1:4], classificationRule=learn["bestResult"])
  # show results
  prediction
  # compare prediction with real results
  paste("accuracy= ",mean(as.integer(iris$Species[remaining.obs]) == prediction["partition"])*100
     	,"%",sep="")
  
## End(Not run)

Rmixmod documentation built on March 11, 2019, 9:08 a.m.