Model-Based Clustering & Classification Using PGMMs

Description

Carries out model-based clustering or classification using parsimonious Gaussian mixture models. AECM algorithms are used for parameter estimation. The BIC or the ICL is used for model selection.

Usage

1
2
pgmmEM(x,rG=1:2,rq=1:2,class=NULL,icl=FALSE,zstart=2,cccStart=TRUE,loop=3,zlist=NULL,
		modelSubset=NULL,seed=123456,tol=0.1,relax=FALSE)

Arguments

x

A matrix or data frame such that rows correspond to observations and columns correspond to variables.

rG

The range of values for the number of components.

rq

The range of values for the number of factors.

class

If NULL then model-based clustering is performed. If a vector with length equal to the number of observations, then model-based classification is performed. In this latter case, the ith entry of class is either zero, indicating that the component membership of observation i is unknown, or it corresponds to the component membership of observation i. See Examples below.

icl

If TRUE then the ICL is used for model selection. Otherwise, the BIC is used for model selection.

zstart

A number that controls what starting values are used: (1) Random; (2) k-means; or (3) user-specified via zlist.

cccStart

If TRUE then random starting values are put through the CCC model and the resulting group memberships are used as starting values for the models specified in modelSubset. Only relevant for zstart=1. See Examples.

loop

A number specifying how many different random starts should be used. Only relevant for zstart=1.

zlist

A list comprising vectors of initial classifications such that zlist[[g]] gives the g-component starting values. Only relevant for zstart=3. See Examples.

modelSubset

A vector of strings giving the models to be used.

seed

A number giving the pseudo-random number seed to be used.

tol

A number specifying the epsilon value for the convergence criteria used in the AECM algorithms. For each algorithm, the criterion is based on the difference between the log-likelihood at an iteration and an asymptotic estimate of the log-likelihood at that iteration. This asymptotic estimate is based on the Aitken acceleration and details are given in the References. Values of tol greater than the default are not accepted.

relax

By default, the number of factors cannot exceed half the number of variables. Setting relax=TRUE relaxes this constraint and allows the number of factors to take any value up to three-quarters the number of variables.

Details

The data x are either clustered using the PGMM approach of McNicholas & Murphy (2005, 2008, 2010) or classified using the method described by McNicholas (2010). In either case, all 12 covariance structures given by McNicholas & Murphy (2010) are available. Parameter estimation is carried out using AECM algorithms, as described in McNicholas et al. (2010). Either the BIC or the ICL is used for model-selection. The number of AECM algorithms to be run depends on the range of values for the number of components rG, the range of values for the number of factors rq, and the number of models in modelSubset. Starting values are very important to the successful operation of these algorithms and so care must be taken in the interpretation of results.

Value

An object of class pgmm is a list with components:

map

A vector of integers, taking values in the range rG, indicating the maximum a posteriori classifications for the best model.

model

A string giving the name of the best model.

g

The number of components for the best model.

q

The number of factors for the best model.

zhat

A matrix giving the raw values upon which map is based.

load

The factor loadings matrix (Lambda) for the best model.

noisev

The Psi matrix for the best model.

plot_info

A list that stores information to enable plot.

summ_info

A list that stores information to enable summary.

In addition, the object will contain one of the following, depending on the value of icl.

bic

A number giving the BIC for each model.

icl

A number giving the ICL for each model.

Note

Dedicated print, plot, and summary functions are available for objects of class pgmm.

Author(s)

Paul D. McNicholas [aut, cre], Aisha ElSherbiny [aut], K. Raju Jampani [ctb], Aaron McDaid [aut], Brendan Murphy [aut], Larry Banks [ctb]

Maintainer: Paul D. McNicholas <mcnicholas@math.mcmaster.ca>

References

Paul D. McNicholas and T. Brendan Murphy (2010). Model-based clustering of microarray expression data via latent Gaussian mixture models. Bioinformatics 26(21), 2705-2712.

Paul D. McNicholas (2010). Model-based classification using latent Gaussian mixture models. Journal of Statistical Planning and Inference 140(5), 1175-1181.

Paul D. McNicholas, T. Brendan Murphy, Aaron F. McDaid and Dermot Frost (2010). Serial and parallel implementations of model-based clustering via parsimonious Gaussian mixture models. Computational Statistics and Data Analysis 54(3), 711-723.

Paul D. McNicholas and T. Brendan Murphy (2008). Parsimonious Gaussian mixture models. Statistics and Computing 18(3), 285-296.

Paul D. McNicholas and T. Brendan Murphy (2005). Parsimonious Gaussian mixture models. Technical Report 05/11, Department of Statistics, Trinity College Dublin.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
  ## Not run: 
# Wine clustering example with three random starts and the CUU model.
 data("wine")
 x<-wine[,-1]
 x<-scale(x)
 wine_clust<-pgmmEM(x,rG=1:4,rq=1:4,zstart=1,loop=3,modelSubset=c("CUU"))
 table(wine[,1],wine_clust$map)

# Wine clustering example with custom starts and the CUU model.
 data("wine")
 x<-wine[,-1]
 x<-scale(x)
 hcl<-hclust(dist(x)) 
 z<-list()
 for(g in 1:4){ 
	 z[[g]]<-cutree(hcl,k=g)
 } 
 wine_clust2<-pgmmEM(x,1:4,1:4,zstart=3,modelSubset=c("CUU"),zlist=z)
 table(wine[,1],wine_clust2$map)
 print(wine_clust2)
 summary(wine_clust2)

# Olive oil classification by region (there are three regions), with two-thirds of
# the observations taken as having known group memberships, using the CUC, CUU and 
# UCU models. 
 data("olive") 
 x<-olive[,-c(1,2)] 
 x<-scale(x) 
 cls<-olive[,1]
 for(i in 1:dim(olive)[1]){
	 if(i%%3==0){cls[i]<-0}
 }
 olive_class<-pgmmEM(x,rG=3:3,rq=4:6,cls,modelSubset=c("CUC","CUU", 
  "CUCU"),relax=TRUE)
 cls_ind<-(cls==0) 
 table(olive[cls_ind,1],olive_class$map[cls_ind])

# Another olive oil classification by region, but this time suppose we only know
# two-thirds of the labels for the first two areas but we suspect that there might 
# be a third or even a fourth area. 
 data("olive") 
 x<-olive[,-c(1,2)] 
 x<-scale(x) 
 cls2<-olive[,1]
 for(i in 1:dim(olive)[1]){
   if(i%%3==0||i>420){cls2[i]<-0}
 }
 olive_class2<-pgmmEM(x,2:4,4:6,cls2,modelSubset=c("CUU"),relax=TRUE)
 cls_ind2<-(cls2==0) 
 table(olive[cls_ind2,1],olive_class2$map[cls_ind2])
 
# Coffee clustering example using k-means starting values for all 12
# models with the ICL being used for model selection instead of the BIC.
 data("coffee")
 x<-coffee[,-c(1,2)]
 x<-scale(x)
 coffee_clust<-pgmmEM(x,rG=2:3,rq=1:3,zstart=2,icl=TRUE)
 table(coffee[,1],coffee_clust$map)
 plot(coffee_clust)
 plot(coffee_clust,onlyAll=TRUE)
  
## End(Not run)
  
# Coffee clustering example using k-means starting values for the UUU model, i.e., the
# mixture of factor analyzers model, for G=2 and q=1.
 data("coffee")
 x<-coffee[,-c(1,2)]
 x<-scale(x)
 coffee_clust_mfa<-pgmmEM(x,2:2,1:1,zstart=2,modelSubset=c("UUU"))
 table(coffee[,1],coffee_clust_mfa$map)
 

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.