Description Usage Arguments Details Value Author(s) References See Also Examples
The PART method estimates the number of clusters in a data set. It is based on recursive application of the Gap statistic and is able to discover both top-level clusters as well as subclusters nested within the main clusters.
1 |
X |
a numeric data matrix whose rows are to be clustered using a specified clustering algorithm (default is hierarchical clustering with average linkage and Euclidean distance, see below for other options). |
Kmax |
the maximum number of clusters to be evaluated in the first (global) run. |
minSize |
the minimum number of objects required in each cluster. |
minDist |
optional stopping threshold indicating the minimum distance required between two tentative clusters considered for a tentative split. If unspecified the minimum distance is determined by the value of |
cl.lab |
optional list of length |
... |
other optional parameters. These include the parameters
|
PART applies the Gap statistic (Tibshirani et al., 2001) to obtain a global estimate of the number of clusters. If more than one cluster is found, the Gap statistic is re-optimized on each subset of cases corresponding to a cluster. If only one cluster is found, a tentative binary split is made and the objective function is re-optimized on the two tentative clusters. The procedure is repeated recursively until a stopping threshold is reached or the subset under evaluation has less than 2*minSize
cases. Significant clusters (those discovered by Gap) are returned; a tentative cluster is only returned if significant sub-clusters were found solely in the other branch of the tentative split. See Nilsen et al. (2013, preprint) for more details.
hatK |
the best number of clusters according PART. |
lab.hatK |
a vector of same length as the number of rows in |
outliers |
a vector indicating which objects are classified as outliers by PART. If no objects are classified as outliers it returns the value |
Gro Nilsen
Nilsen et al., "Identifying clusters in genomics data by recursive partitioning", 2013 (in review)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 | ## Example 1 ##
#Load a simulated data set with 5 clusters
data(exData1)
X = exData1$X
groups1 = exData1$groups
#Run PART (limit the number of reference data sets to decrease computing time):
res <- part(X, B=10)
#Compare predicted groups to true groups in the data set:
cbind(res$lab.hatK, groups1)
## Visualize results ##
#Transpose the data matrix such that samples are shown in columns:
tX <- t(X)
#Cluster rows and columns using the same clustering method as applied in PART:
rowclust = hclust(dist(tX,method="euclidean"),method="average")
colclust = hclust(dist(t(tX), method="euclidean"),method="average")
#Order data matrix according to order in clustering og plot heatmap:
X2 = tX[rowclust$order, colclust$order]
par(mar=c(0,0,0,0))
plotHeatmap(X2)
#Add column-dendrogram with leaves colored according to the clusters found by PART:
plotTreeCol(clust=colclust,groups=res$lab.hatK[colclust$order])
#Add color-bar to indicate the true clusters in the data set:
plotColorbarCol(groups=groups1[colclust$order])
## Example 2 ##
# Load a simulated data set with 4 clusters:
data(exData2)
Y = exData2$Y
groups2 = exData2$groups
# Run PART with default clustering method:
res2 = part(Y, B=10)
# Compare predicted groups to true groups in the data set:
cbind(res2$lab.hatK, groups2)
# Visualize results
# Cluster rows and columns using the same clustering method as applied in PART:
rowclust = hclust(dist(Y,method="euclidean"),method="average")
colclust = hclust(dist(t(Y), method="euclidean"),method="average")
# Order data matrix according to order in clustering og plot heatmap:
Y2 = Y[rowclust$order, colclust$order]
par(mar=c(0,0,0,0))
heat = plotHeatmap(Y2)
# Add row-dendrogram with leaves colored according to the clusters found by PART:
plotTreeRow(clust=rowclust,groups=res2$lab.hatK[rowclust$order])
# Add column-dendrogram:
plotTreeCol(clust=colclust)
#Add color-bar to show the true group memberships:
plotColorbarRow(groups=groups2[rowclust$order])
## Some examples showing how to change clustering method and distance measure ##
#Run PART with complete linkage:
res3 <- part(Y, B=10, linkage="complete")
#Run PART with 1 - Pearson correlation distance
res4 <- part(Y, B=10, dist.method="cor")
#Run PART with 1 minus Spearman correlation distance:
res5 <- part(Y, B=10, dist.method="cor", cor.method="spearman")
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.