clusmca: Joint dimension reduction and clustering of categorical data.

Description Usage Arguments Details Value References See Also Examples

Description

This function implements MCA K-means (Hwang, Dillon and Takane, 2006), i-FCB (Iodice D' Enza and Palumbo, 2013) and Cluster Correspondence Analysis (van de Velden, Iodice D' Enza and Palumbo, 2017). The methods combine variants of Correspondence Analysis for dimension reduction with K-means for clustering.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
clusmca(data, nclus, ndim, method=c("clusCA","iFCB","MCAk"),
alphak = .5, nstart = 100, smartStart = NULL, gamma = TRUE, 
binary = FALSE, seed = NULL)

## S3 method for class 'clusmca'
print(x, ...)

## S3 method for class 'clusmca'
summary(object, ...)

## S3 method for class 'clusmca'
fitted(object, mth = c("centers", "classes"), ...)

Arguments

data

Dataset with categorical variables

nclus

Number of clusters (nclus = 1 returns the MCA solution; see Details)

ndim

Dimensionality of the solution

method

Specifies the method. Options are MCAk for MCA K-means, iFCB for Iterative Factorial Clustering of Binary variables and clusCA for Cluster Correspondence Analysis (default = "clusCA")

alphak

Non-negative scalar to adjust for the relative importance of MCA (alphak = 1) and K-means (alphak = 0) in the solution (default = .5). Works only in combination with method = "MCAk"

nstart

Number of random starts (default = 100)

smartStart

If NULL then a random cluster membership vector is generated. Alternatively, a cluster membership vector can be provided as a starting solution

gamma

Scaling parameter that leads to similar spread in the object and variable scores (default = TRUE)

seed

An integer that is used as argument by set.seed() for offsetting the random number generator when smartStart = NULL. The default value is NULL.

binary

If TRUE then all categorical variables are 0-1 (dummy) variables.

x

For the print method, a class of clusmca

object

For the summary method, a class of clusmca

mth

For the fitted method, a character string that specifies the type of fitted value to return: "centers" for the observations center vector, or "class" for the observations cluster membership value

...

Not used

Details

For the K-means part, the algorithm of Hartigan-Wong is used by default.

The hidden print and summary methods print out some key components of an object of class clusmca.

The hidden fitted method returns cluster fitted values. If method is "classes", this is a vector of cluster membership (the cluster component of the "clusmca" object). If method is "centers", this is a matrix where each row is the cluster center for the observation. The rownames of the matrix are the cluster membership values.

When nclus = 1 the function returns the MCA solution with objects in principal and variables in standard coordinates. plot(object) shows the corresponding asymmetric biplot.

Value

obscoord

Object scores

attcoord

Variable scores

centroid

Cluster centroids

cluster

Cluster membership

criterion

Optimal value of the objective criterion

size

The number of objects in each cluster

nstart

A copy of nstart in the return object

odata

A copy of data in the return object

References

Hwang, H., Dillon, W. R., and Takane, Y. (2006). An extension of multiple correspondence analysis for identifying heterogenous subgroups of respondents. Psychometrika, 71, 161-171.

Iodice D'Enza, A., and Palumbo, F. (2013). Iterative factor clustering of binary data. Computational Statistics, 28(2), 789-807.

van de Velden M., Iodice D' Enza, A., and Palumbo, F. (2017). Cluster correspondence analysis. Psychometrika, 82(1), 158-185.

See Also

cluspca, cluspcamix, tuneclus

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
data(cmc)
# Preprocessing: values of wife's age and number of children were categorized 
# into three groups based on quartiles
cmc$W_AGE = ordered(cut(cmc$W_AGE, c(16,26,39,49), include.lowest = TRUE))
levels(cmc$W_AGE) = c("16-26","27-39","40-49") 
cmc$NCHILD = ordered(cut(cmc$NCHILD, c(0,1,4,17), right = FALSE))
levels(cmc$NCHILD) = c("0","1-4","5 and above")

#Cluster Correspondence Analysis solution with 3 clusters in 2 dimensions 
#after 10 random starts
outclusCA = clusmca(cmc, 3, 2, method = "clusCA", nstart = 10, seed = 1234)
outclusCA
#Scatterplot (dimensions 1 and 2)
plot(outclusCA)

#MCA K-means solution with 3 clusters in 2 dimensions after 10 random starts
outMCAk = clusmca(cmc, 3, 2, method = "MCAk", nstart = 10, seed = 1234)
outMCAk
#Scatterplot (dimensions 1 and 2)
plot(outMCAk)

#nclus = 1 just gives the MCA solution
#outMCA = clusmca(cmc, 1, 2)
#outMCA
#Scatterplot (dimensions 1 and 2) 
#asymmetric biplot with scaling gamma = TRUE
#plot(outMCA)

Example output

Loading required package: ggplot2
Loading required package: dummies
dummies-1.5.6 provided by Decision Patterns

Loading required package: grid
Solution with 3 clusters of sizes 666 (45.2%), 614 (41.7%), 193 (13.1%) in 2 dimensions. 

Cluster centroids:
            Dim.1   Dim.2
Cluster 1 -0.3824  0.6085
Cluster 2  0.8763 -0.3057
Cluster 3 -1.4683 -1.1271

Within cluster sum of squares by cluster:
[1] 191.5892 147.1564 137.3443
 (between_SS / total_SS =  76.32 %) 

Objective criterion value: 487.676 

Available output:

[1] "obscoord"  "attcoord"  "centroid"  "cluster"   "criterion" "size"     
[7] "odata"     "nstart"   
Solution with 3 clusters of sizes 633 (43%), 611 (41.5%), 229 (15.5%) in 2 dimensions. 

Cluster centroids:
            Dim.1   Dim.2
Cluster 1  0.0291 -0.0116
Cluster 2 -0.0138  0.0292
Cluster 3 -0.0436 -0.0459

Within cluster sum of squares by cluster:
[1] 0.0065 0.0071 0.0054
 (between_SS / total_SS =  99.13 %) 

Objective criterion value: 8.2486 

Available output:

[1] "obscoord"  "attcoord"  "centroid"  "cluster"   "criterion" "size"     
[7] "odata"     "nstart"   

clustrd documentation built on May 8, 2019, 5:03 p.m.