part: Partitioning Algorithm based on Recursive Thresholding

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

The PART method estimates the number of clusters in a data set. It is based on recursive application of the Gap statistic and is able to discover both top-level clusters as well as subclusters nested within the main clusters.

Usage

1
part(X,Kmax=10,minSize=8,minDist=NULL,cl.lab=NULL,...)

Arguments

X

a numeric data matrix whose rows are to be clustered using a specified clustering algorithm (default is hierarchical clustering with average linkage and Euclidean distance, see below for other options).

Kmax

the maximum number of clusters to be evaluated in the first (global) run.

minSize

the minimum number of objects required in each cluster.

minDist

optional stopping threshold indicating the minimum distance required between two tentative clusters considered for a tentative split. If unspecified the minimum distance is determined by the value of q, see below.

cl.lab

optional list of length Kmax giving vectors of cluster labels for the rows in X when partitioned into 1,..,Kmax clusters.

...

other optional parameters. These include the parameters B (default 100) and ref.gen (default "PC") to be passed on to gap, as well as:

q:

the fraction of dendrogram heights (from the top) used to determine the stopping threshold; only applied if minDist is unspecified. Default is 0.25, set q=1 if no stopping threshold should be applied.

Kmax.rec:

the maximum number of clusters to consider in each recursive run. Default is 5.

cl.method:

the desired clustering method. Options currently include "hclust" (default) and "kmeans".

linkage:

the desired linkage to be applied if cl.method="hclust". Default is "average", see the parameter method in hclust for other options.

dist.method:

the desired distance measure to be applied if cl.method="hclust". Default is "euclidean". Other options include those supported by dist (under method), "sq.euclidean" (squared Euclidean distance) and "cor" (1 minus correlation distance).

cor.method:

the correlation measure to be used if dist.method="cor". Default is "pearson", see the parameter method in cor for other options.

nstart:

the number of initial center sets to be applied if cl.method="kmeans". Default is 10. See kmeans for details on this.

Details

PART applies the Gap statistic (Tibshirani et al., 2001) to obtain a global estimate of the number of clusters. If more than one cluster is found, the Gap statistic is re-optimized on each subset of cases corresponding to a cluster. If only one cluster is found, a tentative binary split is made and the objective function is re-optimized on the two tentative clusters. The procedure is repeated recursively until a stopping threshold is reached or the subset under evaluation has less than 2*minSize cases. Significant clusters (those discovered by Gap) are returned; a tentative cluster is only returned if significant sub-clusters were found solely in the other branch of the tentative split. See Nilsen et al. (2013, preprint) for more details.

Value

hatK

the best number of clusters according PART.

lab.hatK

a vector of same length as the number of rows in X assigning a group label to each case (row) in X based on the best partition as evaluated by PART.

outliers

a vector indicating which objects are classified as outliers by PART. If no objects are classified as outliers it returns the value NULL.

Author(s)

Gro Nilsen

References

Nilsen et al., "Identifying clusters in genomics data by recursive partitioning", 2013 (in review)

See Also

gap, plotHeatmap

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
## Example 1 ##
#Load a simulated data set with 5 clusters
data(exData1)
X = exData1$X
groups1 = exData1$groups

#Run PART (limit the number of reference data sets to decrease computing time):
res <- part(X, B=10)

#Compare predicted groups to true groups in the data set:
cbind(res$lab.hatK, groups1)

## Visualize results ##
#Transpose the data matrix such that samples are shown in columns:
tX <- t(X) 
#Cluster rows and columns using the same clustering method as applied in PART:
rowclust = hclust(dist(tX,method="euclidean"),method="average")
colclust = hclust(dist(t(tX), method="euclidean"),method="average")
#Order data matrix according to order in clustering og plot heatmap:
X2 = tX[rowclust$order, colclust$order]
par(mar=c(0,0,0,0))
plotHeatmap(X2)
#Add column-dendrogram with leaves colored according to the clusters found by PART: 
plotTreeCol(clust=colclust,groups=res$lab.hatK[colclust$order])
#Add color-bar to indicate the true clusters in the data set:
plotColorbarCol(groups=groups1[colclust$order]) 


## Example 2 ##
# Load a simulated data set with 4 clusters:
data(exData2)
Y = exData2$Y
groups2 = exData2$groups

# Run PART with default clustering method:
res2 = part(Y, B=10)

# Compare predicted groups to true groups in the data set:
cbind(res2$lab.hatK, groups2)

# Visualize results
# Cluster rows and columns using the same clustering method as applied in PART:
rowclust = hclust(dist(Y,method="euclidean"),method="average")
colclust = hclust(dist(t(Y), method="euclidean"),method="average")
# Order data matrix according to order in clustering og plot heatmap:
Y2 = Y[rowclust$order, colclust$order]
par(mar=c(0,0,0,0))
heat = plotHeatmap(Y2)
# Add row-dendrogram with leaves colored according to the clusters found by PART: 
plotTreeRow(clust=rowclust,groups=res2$lab.hatK[rowclust$order])
# Add column-dendrogram:
plotTreeCol(clust=colclust)
#Add color-bar to show the true group memberships:
plotColorbarRow(groups=groups2[rowclust$order]) 


## Some examples showing how to change clustering method and distance measure ##

#Run PART with complete linkage:
res3 <- part(Y, B=10, linkage="complete")

#Run PART with 1 - Pearson correlation distance
res4 <- part(Y, B=10, dist.method="cor")

#Run PART with 1 minus Spearman correlation distance:
res5 <- part(Y, B=10, dist.method="cor", cor.method="spearman")
 

clusterGenomics documentation built on May 2, 2019, 7:04 a.m.