Cluster: Perform clustering on a single data source
In IntClust: Integrated Data Analysis via Clustering

Description Usage Arguments Details Value Note Author(s) Examples

The function Cluster was written to perform clustering on a single source of information, i.e one data matrix. The option is available to compute the gap statistic to determine the optimal number of clusters.

1
2
3

Cluster(Data,type=c("data","dist"), distmeasure = "tanimoto",
normalize=FALSE,method=NULL, clust = "agnes", linkage ="ward",alpha=0.625
,gap = TRUE,maxK = 50,StopRange=FALSE)

`Data`	A matrix containing the data. It is assumed the rows are corresponding with the objects.
`type`	Type indicates whether the provided matrix in "Data" is either a data or a distance matrix obtained from the data. If type="dist" the calculation of the distance matrix is skipped. Type should be one of "data" or "dist".
`distmeasure`	Choice of metric for the dissimilarity matrix (character). Should be one of "tanimoto", "euclidean", "jaccard","hamming".
`normalize`	Logical. Indicates whether to normalize the distance matrices or not. This is recommended if different distance types are used. More details on normalization in `Normalization`.
`method`	A method of normalization. Should be one of "Quantile","Fisher-Yates", "standardize","Range" or any of the first letters of these names.
`clust`	Choice of clustering function (character). Defaults to "agnes".
`linkage`	Choice of inter group dissimilarity (character). Defaults to "ward".
`alpha`	The parameter alpha to be used in the "flexible" linkage of the agnes function. Defaults to 0.625 and is only used if the linkage is set to "flexible"
`gap`	Logical. Indicator if gap statistics should be computed. Setting to $FALSE$ will greatly reduce the computation time.
`maxK`	The maximum number of clusters to be considered during the gap.
`StopRange`	Logical. Indicates whether the distance matrices with values not between zero and one should be standardized to have so. If FALSE the range normalization is performed. See `Normalization`. If TRUE, the distance matrices are not changed. This is recommended if different types of data are used such that these are comparable.

The gap statistic is determined by the criteria described by the cluster package: firstSEmax, globalSEmax, firstmax,globalmax, Tibs2001SEmax. The number of iterations is set to a default of 500. The implemented distances to be used for the dissimilarity matrix are jaccard, tanimoto and euclidean. The jaccard distances were computed with the dist.binary(...,method=1) function in the ade4 package and the euclidean ones with the daisy function in again the cluster package. The Tanimoto distances were implemented manually.

The returned value is a list with two elements:

`DistM`	The distance matrix of the data matrix
`Clust`	The resulting clustering

If the gap option was indicated to be true, another 3 elements are joined to the list. Clust\_gap contains the output from the function to compute the gap statistics and gapdata is a subset of this output. Both can be used to make plots to visualize the gap statistic. The final component is k which is a matrix containing the optimal number of clusters determined by each criterion mentioned earlier.

For now, the only option is to carry out agglomerative hierarchical clustering as it was implemented in the agnes function in the cluster package.

Marijke Van Moerbeke

data(fingerprintMat)
data(targetMat)

MCF7_F = Cluster(fingerprintMat,type="data",distmeasure="tanimoto",normalize=FALSE,
method=NULL,clust="agnes",linkage="ward",alpha=0.625,gap=FALSE,maxK=55
,StopRange=FALSE)
MCF7_T = Cluster(targetMat,type="data",distmeasure="tanimoto",normalize=FALSE,
method=NULL,clust="agnes",linkage="ward",alpha=0.625,gap=FALSE,maxK=55
,StopRange=FALSE)