cluspca: Joint dimension reduction and clustering of continuous data.

Description Usage Arguments Details Value References See Also Examples

Description

This function implements Factorial K-means (Vichi and Kiers, 2001) and Reduced K-means (De Soete and Carroll, 1994), as well as a compromise version of these two methods. The methods combine Principal Component Analysis for dimension reduction with K-means for clustering.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
cluspca(data, nclus, ndim, alpha = NULL, method = c("RKM","FKM"), 
center = TRUE, scale = TRUE, rotation = "none", nstart = 100, 
smartStart = NULL, seed = NULL)

## S3 method for class 'cluspca'
print(x, ...)

## S3 method for class 'cluspca'
summary(object, ...)

## S3 method for class 'cluspca'
fitted(object, mth = c("centers", "classes"), ...)

Arguments

data

Dataset with metric variables

nclus

Number of clusters (nclus = 1 returns the PCA solution

ndim

Dimensionality of the solution

method

Specifies the method. Options are RKM for reduced K-means and FKM for factorial K-means (default = "RKM")

alpha

Adjusts for the relative importance of RKM and FKM in the objective function; alpha = 0.5 leads to reduced K-means, alpha = 0 to factorial K-means, and alpha = 1 reduces to the tandem approach (PCA followed by K-means)

center

A logical value indicating whether the variables should be shifted to be zero centered (default = TRUE)

scale

A logical value indicating whether the variables should be scaled to have unit variance before the analysis takes place (default = TRUE)

rotation

Specifies the method used to rotate the factors. Options are none for no rotation, varimax for varimax rotation with Kaiser normalization and promax for promax rotation (default = "none")

nstart

Number of starts (default = 100)

smartStart

If NULL then a random cluster membership vector is generated. Alternatively, a cluster membership vector can be provided as a starting solution

seed

An integer that is used as argument by set.seed() for offsetting the random number generator when smartStart = NULL. The default value is NULL.

x

For the print method, a class of clusmca

object

For the summary method, a class of clusmca

mth

For the fitted method, a character string that specifies the type of fitted value to return: "centers" for the observations center vector, or "class" for the observations cluster membership value

...

Not used

Details

For the K-means part, the algorithm of Hartigan-Wong is used by default.

The hidden print and summary methods print out some key components of an object of class cluspca.

The hidden fitted method returns cluster fitted values. If method is "classes", this is a vector of cluster membership (the cluster component of the "cluspca" object). If method is "centers", this is a matrix where each row is the cluster center for the observation. The rownames of the matrix are the cluster membership values.

When nclus = 1 the function returns the PCA solution and plot(object) shows the corresponding biplot.

Value

obscoord

Object scores

attcoord

Variable scores

centroid

Cluster centroids

cluster

Cluster membership

criterion

Optimal value of the objective function

size

The number of objects in each cluster

scale

A copy of scale in the return object

center

A copy of center in the return object

nstart

A copy of nstart in the return object

odata

A copy of data in the return object

References

De Soete, G., and Carroll, J. D. (1994). K-means clustering in a low-dimensional Euclidean space. In Diday E. et al. (Eds.), New Approaches in Classification and Data Analysis, Heidelberg: Springer, 212-219.

Vichi, M., and Kiers, H.A.L. (2001). Factorial K-means analysis for two-way data. Computational Statistics and Data Analysis, 37, 49-64.

See Also

clusmca, cluspcamix, tuneclus

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
#Reduced K-means with 3 clusters in 2 dimensions after 10 random starts
data(macro)
outRKM = cluspca(macro, 3, 2, method = "RKM", rotation = "varimax", scale = FALSE, nstart = 10)
summary(outRKM)
#Scatterplot (dimensions 1 and 2) and cluster description plot
plot(outRKM, cludesc = TRUE)

#Factorial K-means with 3 clusters in 2 dimensions 
#with a Reduced K-means starting solution
data(macro)
outFKM = cluspca(macro, 3, 2, method = "FKM", rotation = "varimax", 
scale = FALSE, smartStart = outRKM$cluster)
outFKM
#Scatterplot (dimensions 1 and 2) and cluster description plot
plot(outFKM, cludesc = TRUE)

#To get the Tandem approach (PCA(SVD) + K-means)
outTandem = cluspca(macro, 3, 2, alpha = 1, seed = 1234)
plot(outTandem)

#nclus = 1 just gives the PCA solution 
#outPCA = cluspca(macro, 1, 2)
#outPCA
#Scatterplot (dimensions 1 and 2) 
#plot(outPCA)

Example output

Loading required package: ggplot2
Loading required package: dummies
dummies-1.5.6 provided by Decision Patterns

Loading required package: grid
Solution with 3 clusters of sizes 12 (60%), 5 (25%), 3 (15%) in 2 dimensions. Variables were mean centered and unstandardized.

Cluster centroids:
            Dim.1   Dim.2
Cluster 1 -1.1627 -2.9713
Cluster 2 -3.5997  5.9900
Cluster 3 10.6502  1.9020

Variable scores:
      Dim.1   Dim.2
GDP  0.0638 -0.1169
LI  -0.1734 -0.0140
UR  -0.0610 -0.4849
IR   0.6662 -0.0344
TB  -0.7179  0.0678
NNS  0.0544  0.8633

Within cluster sum of squares by cluster:
[1] 113.4856  23.2023  45.8149
 (between_SS / total_SS =  79.72 %) 

Clustering vector:
  Australia      Canada     Finland      France       Spain      Sweden 
          1           1           1           1           1           1 
        USA Netherlands      Greece      Mexico    Portugal     Austria 
          1           2           3           3           3           1 
    Belgium     Denmark     Germany       Italy       Japan      Norway 
          2           1           1           1           2           2 
Switzerland          UK 
          2           1 

Objective criterion value: 431.7131 

Available output:

 [1] "obscoord"  "attcoord"  "centroid"  "cluster"   "criterion" "size"     
 [7] "odata"     "scale"     "center"    "nstart"   
$map

$parcoord

Solution with 3 clusters of sizes 12 (60%), 5 (25%), 3 (15%) in 2 dimensions. Variables were mean centered and unstandardized.

Cluster centroids:
            Dim.1   Dim.2
Cluster 1 -0.2945 -0.8344
Cluster 2 -3.9747  1.7404
Cluster 3  7.8024  0.4367

Variable scores:
      Dim.1   Dim.2
GDP  0.2272  0.9209
LI  -0.6554  0.1850
UR   0.0504 -0.1255
IR   0.6648 -0.1139
TB  -0.2666 -0.0412
NNS -0.0574  0.2956

Within cluster sum of squares by cluster:
[1] 26.6997 12.7474  1.0522
 (between_SS / total_SS =  87.62 %) 

Objective criterion value: 40.4992 

Available output:

 [1] "obscoord"  "attcoord"  "centroid"  "cluster"   "criterion" "size"     
 [7] "odata"     "scale"     "center"    "nstart"   
$map

$parcoord

Warning messages:
1: Removed 1 rows containing missing values (geom_segment). 
2: Removed 1 rows containing missing values (geom_text_repel). 

clustrd documentation built on May 8, 2019, 5:03 p.m.