Cluster: Single source clustering

Description Usage Arguments Details Value Examples

Description

The function Cluster performs clustering on a single source of information, i.e one data matrix. The option is available to compute the gap statistic to determine the optimal number of clusters.

Usage

1
2
3
4
Cluster(Data, type = c("data", "dist"), distmeasure = "tanimoto",
  normalize = FALSE, method = NULL, clust = "agnes",
  linkage = "flexible", alpha = 0.625, gap = TRUE, maxK = 15,
  StopRange = TRUE)

Arguments

Data

A matrix containing the data. It is assumed the rows are corresponding with the objects.

type

Type indicates whether the provided matrix in "Data" is either a data or a distance matrix obtained from the data. If type="dist" the calculation of the distance matrix is skipped. Type should be one of "data" or "dist".

distmeasure

Choice of metric for the dissimilarity matrix (character). Should be one of "tanimoto", "euclidean", "jaccard","hamming". Default is "tanimoto".

normalize

Logical. Indicates whether to normalize the distance matrices or not, default is FALSE. This is recommended if different distance types are used. More details on normalization in Normalization

method

A method of normalization. Should be one of "Quantile","Fisher-Yates", "standardize","Range" or any of the first letters of these names. Default is NULL.

clust

Choice of clustering function (character). Defaults to "agnes". Note for now, the only option is to carry out agglomerative hierarchical clustering as it was implemented in the agnes function in the cluster package.

linkage

Choice of inter group dissimilarity (character). Defaults to "flexible".

alpha

The parameter alpha to be used in the "flexible" linkage of the agnes function. Defaults to 0.625 and is only used if the linkage is set to "flexible"

gap

Logical. Whether the optimal number of clusters should be determined with the gap statistic. Default is TRUE.

maxK

The maximal number of clusters to investigate in the gap statistic. Default is 15.

StopRange

Logical. Indicates whether the distance matrices with values not between zero and one should be standardized to have so. #' If FALSE the range normalization is performed. See Normalization. If TRUE, the distance matrices are not changed. This is recommended if different types of data are used such that these are comparable. Default is TRUE.

Details

The gap statistic is determined by the criteria described by the cluster package: firstSEmax, globalSEmax, firstmax,globalmax, Tibs2001SEmax. The number of iterations is set to a default of 500. The implemented distances to be used for the dissimilarity matrix are jaccard, tanimoto and euclidean. The jaccard distances were computed with the dist.binary(...,method=1) function in the ade4 package and the euclidean ones with the daisy function in again the cluster package. The Tanimoto distances were implemented manually.

Value

The returned value is a list with two elements:

DistM

The distance matrix of the data matrix

Clust

The resulting clustering

If the gap option was indicated to be true, another 3 elements are joined to the list. Clust\_gap contains the output from the function to compute the gap statistics and gapdata is a subset of this output. Both can be used to make plots to visualize the gap statistic. The final component is k which is a matrix containing the optimal number of clusters determined by each criterion mentioned earlier.

Examples

1
2
3
4
5
6
7
8
9
data(fingerprintMat)
data(targetMat)

MCF7_F = Cluster(fingerprintMat,type="data",distmeasure="tanimoto",normalize=FALSE,
		method=NULL,clust="agnes",linkage="flexible",alpha=0.625,gap=FALSE,maxK=55
		,StopRange=FALSE)
MCF7_T = Cluster(targetMat,type="data",distmeasure="tanimoto",normalize=FALSE,
		method=NULL,clust="agnes",linkage="flexible",alpha=0.625,gap=FALSE,maxK=55
		,StopRange=FALSE)

IntClust documentation built on May 2, 2019, 5:51 a.m.