In Japrin/mclust: Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation

library(knitr)
opts_chunk$set(fig.align = "center", 
               out.width = "90%",
               fig.width = 6, fig.height = 5.5,
               dev.args=list(pointsize=10),
               par = TRUE, # needed for setting hook 
               collapse = TRUE, # collapse input & ouput code in chunks
               warning = FALSE)

knit_hooks$set(par = function(before, options, envir)
  { if(before && options$fig.show != "none") 
       par(family = "sans", mar=c(4.1,4.1,1.1,1.1), mgp=c(3,1,0), tcl=-0.5)
})
set.seed(1) # for exact reproducibility

Introduction

mclust is a contributed R package for model-based clustering, classification, and density estimation based on finite normal mixture modelling. It provides functions for parameter estimation via the EM algorithm for normal mixture models with a variety of covariance structures, and functions for simulation from these models. Also included are functions that combine model-based hierarchical clustering, EM for mixture estimation and the Bayesian Information Criterion (BIC) in comprehensive strategies for clustering, density estimation and discriminant analysis. Additional functionalities are available for displaying and visualizing fitted models along with clustering, classification, and density estimation results.

This document gives a quick tour of mclust (version r packageVersion("mclust")) functionalities. It was written in R Markdown, using the knitr package for production. See help(package="mclust") for further details and references provided by citation("mclust").

library(mclust)
cat(mclust:::mclustStartupMessage(), sep="")

Clustering

data(diabetes)
class <- diabetes$class
table(class)
X <- diabetes[,-1]
head(X)
clPairs(X, class)

BIC <- mclustBIC(X)
plot(BIC)
summary(BIC)

mod1 <- Mclust(X, x = BIC)
summary(mod1, parameters = TRUE)

plot(mod1, what = "classification")
table(class, mod1$classification)

plot(mod1, what = "uncertainty")

ICL <- mclustICL(X)
summary(ICL)
plot(ICL)

LRT <- mclustBootstrapLRT(X, modelName = "VVV")
LRT

Initialisation

EM algorithm is used by mclust for maximum likelihood estimation. Initialisation of EM is performed using the partitions obtained from agglomerative hierarchical clustering. For details see help(mclustBIC) or help(Mclust), and help(hc).

(hc1 <- hc(X, modelName = "VVV", use = "SVD"))
BIC1 <- mclustBIC(X, initialization = list(hcPairs = hc1)) # default 
summary(BIC1)

(hc2 <- hc(X, modelName = "VVV", use = "VARS"))
BIC2 <- mclustBIC(X, initialization = list(hcPairs = hc2))
summary(BIC2)

(hc3 <- hc(X, modelName = "EEE", use = "SVD"))
BIC3 <- mclustBIC(X, initialization = list(hcPairs = hc3))
summary(BIC3)

Update BIC by merging the best results:

BIC <- mclustBICupdate(BIC1, BIC2, BIC3)
summary(BIC)
plot(BIC)

Univariate fit using random starting points obtained by creating random agglomerations (see help(randomPairs)) and merging best results:

set.seed(20181116)
data(galaxies, package = "MASS") 
galaxies <- galaxies / 1000
BIC <- NULL
for(j in 1:20)
{
  rBIC <- mclustBIC(galaxies, verbose = FALSE,
                    initialization = list(hcPairs = randomPairs(galaxies)))
  BIC <- mclustBICupdate(BIC, rBIC)
}
summary(BIC)
plot(BIC)
mod <- Mclust(galaxies, x = BIC)
summary(mod)

Classification

EDDA

data(iris)
class <- iris$Species
table(class)
X <- iris[,1:4]
head(X)
mod2 <- MclustDA(X, class, modelType = "EDDA")
summary(mod2)
plot(mod2, what = "scatterplot")
plot(mod2, what = "classification")

MclustDA

data(banknote)
class <- banknote$Status
table(class)
X <- banknote[,-1]
head(X)
mod3 <- MclustDA(X, class)
summary(mod3)
plot(mod3, what = "scatterplot")
plot(mod3, what = "classification")

Cross-validation error

cv <- cvMclustDA(mod2, nfold = 10)
str(cv)
unlist(cv[3:4])
cv <- cvMclustDA(mod3, nfold = 10)
str(cv)
unlist(cv[3:4])

Density estimation

Univariate

data(acidity)
mod4 <- densityMclust(acidity)
summary(mod4)
plot(mod4, what = "BIC")
plot(mod4, what = "density", data = acidity, breaks = 15)
plot(mod4, what = "diagnostic", type = "cdf")
plot(mod4, what = "diagnostic", type = "qq")

Multivariate

data(faithful)
mod5 <- densityMclust(faithful)
summary(mod5)
plot(mod5, what = "BIC")
plot(mod5, what = "density")
plot(mod5, what = "density", type = "hdr")
plot(mod5, what = "density", type = "hdr",
     data = faithful, points.cex = 0.5)
plot(mod5, what = "density", type = "persp")

Bootstrap inference

boot1 <- MclustBootstrap(mod1, nboot = 999, type = "bs")
summary(boot1, what = "se")
summary(boot1, what = "ci")

par(mfrow=c(4,3))
plot(boot1, what = "pro")
plot(boot1, what = "mean")
par(mfrow=c(1,1))

boot4 <- MclustBootstrap(mod4, nboot = 999, type = "bs")
summary(boot4, what = "se")
summary(boot4, what = "ci")

par(mfrow=c(2,2))
plot(boot4, what = "pro")
plot(boot4, what = "mean")
par(mfrow=c(1,1))

Dimension reduction

Clustering

mod1dr <- MclustDR(mod1)
summary(mod1dr)
plot(mod1dr, what = "pairs")
plot(mod1dr, what = "boundaries", ngrid = 200)

mod1dr <- MclustDR(mod1, lambda = 1)
summary(mod1dr)
plot(mod1dr, what = "scatterplot")
plot(mod1dr, what = "boundaries", ngrid = 200)

Classification

mod2dr <- MclustDR(mod2)
summary(mod2dr)
plot(mod2dr, what = "scatterplot")
plot(mod2dr, what = "boundaries", ngrid = 200)

mod3dr <- MclustDR(mod3)
summary(mod3dr)
plot(mod3dr, what = "scatterplot")
plot(mod3dr, what = "boundaries", ngrid = 200)

Using colorblind-friendly palettes

Most of the graphs produced by mclust use colors that by default are defined in the following options:

mclust.options("bicPlotColors")
mclust.options("classPlotColors")

The first option controls colors used for plotting BIC, ICL, etc. curves, whereas the second option is used to assign colors for indicating clusters or classes when plotting data.

Color-blind-friendly palettes can be defined and assigned to the above options as follows:

cbPalette <- c("#E69F00", "#56B4E9", "#009E73", "#999999", "#F0E442", "#0072B2", "#D55E00", "#CC79A7")
bicPlotColors <- mclust.options("bicPlotColors")
bicPlotColors[1:14] <- c(cbPalette, cbPalette[1:6])
mclust.options("bicPlotColors" = bicPlotColors)
mclust.options("classPlotColors" = cbPalette)

clPairs(iris[,-5], iris$Species)
mod <- Mclust(iris[,-5])
plot(mod, what = "BIC")
plot(mod, what = "classification")

The above color definitions are adapted from http://www.cookbook-r.com/Graphs/Colors_(ggplot2)/, but users can easily define their own palettes if needed.

References

Scrucca L., Fop M., Murphy T. B. and Raftery A. E. (2016) mclust 5: clustering, classification and density estimation using Gaussian finite mixture models, The R Journal, 8/1, pp. 205-233. https://journal.r-project.org/archive/2016/RJ-2016-021/RJ-2016-021.pdf

Fraley C. and Raftery A. E. (2002) Model-based clustering, discriminant analysis and density estimation, Journal of the American Statistical Association, 97/458, pp. 611-631.

Fraley C., Raftery A. E., Murphy T. B. and Scrucca L. (2012) mclust Version 4 for R: Normal Mixture Modeling for Model-Based Clustering, Classification, and Density Estimation. Technical Report No. 597, Department of Statistics, University of Washington.

sessionInfo()

Japrin/mclust documentation built on Nov. 18, 2019, 5:21 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Japrin/mclust
Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation

In Japrin/mclust: Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation

Introduction

Clustering

Initialisation

Classification

EDDA

MclustDA

Cross-validation error

Density estimation

Univariate

Multivariate

Bootstrap inference

Dimension reduction

Clustering

Classification

Using colorblind-friendly palettes

References

R Package Documentation

Browse R Packages

We want your feedback!

Japrin/mclust Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation

In Japrin/mclust: Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation

Introduction

Clustering

Initialisation

Classification

EDDA

MclustDA

Cross-validation error

Density estimation

Univariate

Multivariate

Bootstrap inference

Dimension reduction

Clustering

Classification

Using colorblind-friendly palettes

References

R Package Documentation

Browse R Packages

We want your feedback!

Japrin/mclust
Gaussian Mixture Modelling for Model-Based Clustering, Classification, and Density Estimation