part: Partitioning Algorithm based on Recursive Thresholding
In clusterGenomics: Identifying clusters in genomics data by recursive partitioning

Description Usage Arguments Details Value Author(s) References See Also Examples

The PART method estimates the number of clusters in a data set. It is based on recursive application of the Gap statistic and is able to discover both top-level clusters as well as subclusters nested within the main clusters.

1	part(X,Kmax=10,minSize=8,minDist=NULL,cl.lab=NULL,...)

`X`	a numeric data matrix whose rows are to be clustered using a specified clustering algorithm (default is hierarchical clustering with average linkage and Euclidean distance, see below for other options).
`Kmax`	the maximum number of clusters to be evaluated in the first (global) run.
`minSize`	the minimum number of objects required in each cluster.
`minDist`	optional stopping threshold indicating the minimum distance required between two tentative clusters considered for a tentative split. If unspecified the minimum distance is determined by the value of `q`, see below.
`cl.lab`	optional list of length `Kmax` giving vectors of cluster labels for the rows in `X` when partitioned into `1,..,Kmax` clusters.
`...`	other optional parameters. These include the parameters `B` (default 100) and `ref.gen` (default "PC") to be passed on to `gap`, as well as: `q`: the fraction of dendrogram heights (from the top) used to determine the stopping threshold; only applied if `minDist` is unspecified. Default is 0.25, set `q=1` if no stopping threshold should be applied. `Kmax.rec`: the maximum number of clusters to consider in each recursive run. Default is 5. `cl.method`: the desired clustering method. Options currently include "hclust" (default) and "kmeans". `linkage`: the desired linkage to be applied if `cl.method="hclust"`. Default is "average", see the parameter `method` in `hclust` for other options. `dist.method`: the desired distance measure to be applied if `cl.method="hclust"`. Default is "euclidean". Other options include those supported by `dist` (under `method`), "sq.euclidean" (squared Euclidean distance) and "cor" (1 minus correlation distance). `cor.method`: the correlation measure to be used if `dist.method="cor"`. Default is "pearson", see the parameter `method` in `cor` for other options. `nstart`: the number of initial center sets to be applied if `cl.method="kmeans"`. Default is 10. See `kmeans` for details on this.

PART applies the Gap statistic (Tibshirani et al., 2001) to obtain a global estimate of the number of clusters. If more than one cluster is found, the Gap statistic is re-optimized on each subset of cases corresponding to a cluster. If only one cluster is found, a tentative binary split is made and the objective function is re-optimized on the two tentative clusters. The procedure is repeated recursively until a stopping threshold is reached or the subset under evaluation has less than 2*minSize cases. Significant clusters (those discovered by Gap) are returned; a tentative cluster is only returned if significant sub-clusters were found solely in the other branch of the tentative split. See Nilsen et al. (2013, preprint) for more details.

`hatK`	the best number of clusters according PART.
`lab.hatK`	a vector of same length as the number of rows in `X` assigning a group label to each case (row) in `X` based on the best partition as evaluated by PART.
`outliers`	a vector indicating which objects are classified as outliers by PART. If no objects are classified as outliers it returns the value `NULL`.

Gro Nilsen

Nilsen et al., "Identifying clusters in genomics data by recursive partitioning", 2013 (in review)

gap, plotHeatmap

## Example 1 ##
#Load a simulated data set with 5 clusters
data(exData1)
X = exData1$X
groups1 = exData1$groups

#Run PART (limit the number of reference data sets to decrease computing time):
res <- part(X, B=10)

#Compare predicted groups to true groups in the data set:
cbind(res$lab.hatK, groups1)

## Visualize results ##
#Transpose the data matrix such that samples are shown in columns:
tX <- t(X) 
#Cluster rows and columns using the same clustering method as applied in PART:
rowclust = hclust(dist(tX,method="euclidean"),method="average")
colclust = hclust(dist(t(tX), method="euclidean"),method="average")
#Order data matrix according to order in clustering og plot heatmap:
X2 = tX[rowclust$order, colclust$order]
par(mar=c(0,0,0,0))
plotHeatmap(X2)
#Add column-dendrogram with leaves colored according to the clusters found by PART: 
plotTreeCol(clust=colclust,groups=res$lab.hatK[colclust$order])
#Add color-bar to indicate the true clusters in the data set:
plotColorbarCol(groups=groups1[colclust$order]) 


## Example 2 ##
# Load a simulated data set with 4 clusters:
data(exData2)
Y = exData2$Y
groups2 = exData2$groups

# Run PART with default clustering method:
res2 = part(Y, B=10)

# Compare predicted groups to true groups in the data set:
cbind(res2$lab.hatK, groups2)

# Visualize results
# Cluster rows and columns using the same clustering method as applied in PART:
rowclust = hclust(dist(Y,method="euclidean"),method="average")
colclust = hclust(dist(t(Y), method="euclidean"),method="average")
# Order data matrix according to order in clustering og plot heatmap:
Y2 = Y[rowclust$order, colclust$order]
par(mar=c(0,0,0,0))
heat = plotHeatmap(Y2)
# Add row-dendrogram with leaves colored according to the clusters found by PART: 
plotTreeRow(clust=rowclust,groups=res2$lab.hatK[rowclust$order])
# Add column-dendrogram:
plotTreeCol(clust=colclust)
#Add color-bar to show the true group memberships:
plotColorbarRow(groups=groups2[rowclust$order]) 


## Some examples showing how to change clustering method and distance measure ##

#Run PART with complete linkage:
res3 <- part(Y, B=10, linkage="complete")

#Run PART with 1 - Pearson correlation distance
res4 <- part(Y, B=10, dist.method="cor")

#Run PART with 1 minus Spearman correlation distance:
res5 <- part(Y, B=10, dist.method="cor", cor.method="spearman")