SelectnrClusters: Determines an optimal number of clusters based on silhouette...

Description Usage Arguments Value Examples

Description

The function SelectnrClusters determines an optimal optimal number of clusters based by calculating silhouettes widths for a sequence of clusters. See "Details" for a more elaborate description.

If the object provided in List are data or distance matrices clustering around medoids is performed with the pam function of the cluster package. Of the obtained pam objects, average silhouette widths are retrieved. A silhouette width represents how well an object lies in its current cluster. Values around one are an indication of an appropriate clustering while values around zero show that the object might as well lie in the neighbouring cluster. The average silhouette width is a measure of how tightly grouped the data is. This is performed for every number of cluster for every object provided in List. Then the average is taken for every number of clusters over the provided objects. This results in one average value per number of clusters. The number width the maximal average silhouette width is chosen as the optimal number of clusters.

Usage

1
2
3
4
SelectnrClusters(List, type = c("data", "dist", "pam"),
  distmeasure = c("tanimoto", "tanimoto"), normalize = c(FALSE, FALSE),
  method = c(NULL, NULL), nrclusters = seq(5, 25, 1), names = NULL,
  StopRange = FALSE, plottype = "new", location = NULL)

Arguments

List

A list of data matrices. It is assumed the rows are corresponding with the objects.

type

indicates whether the provided matrices in "List" are either data matrices, distance matrices or clustering results obtained from the data. If type="dist" the calculation of the distance matrices is skipped and if type="clusters" the single source clustering is skipped. Type should be one of "data", "dist" or "clusters".

distmeasure

A vector of the distance measures to be used on each data matrix. Should be one of "tanimoto", "euclidean", "jaccard", "hamming". Defaults to c("tanimoto","tanimoto").

normalize

Logical. Indicates whether to normalize the distance matrices or not, defaults to c(FALSE, FALSE) for two data sets. This is recommended if different distance types are used. More details on normalization in Normalization.

method

A method of normalization. Should be one of "Quantile","Fisher-Yates", "standardize","Range" or any of the first letters of these names. Default is c(NULL,NULL) for two data sets.

nrclusters

A sequence of numbers of clusters to cut the dendrogram in. Default is a sequence of 5 to 25.

names

The labels to give to the elements in List. Default is NULL.

StopRange

Logical. Indicates whether the distance matrices with values not between zero and one should be standardized to have so. If FALSE the range normalization is performed. See Normalization. If TRUE, the distance matrices are not changed. This is recommended if different types of data are used such that these are comparable. Default is FALSE.

plottype

Should be one of "pdf","new" or "sweave". If "pdf", a location should be provided in "location" and the figure is saved there. If "new" a new graphic device is opened and if "sweave", the figure is made compatible to appear in a sweave or knitr document, i.e. no new device is opened and the plot appears in the current device or document. Default is "new".

location

If plottype is "pdf", a location should be provided in "location" and the figure is saved there. Default is NULL.

Value

A plots are made showing the average silhouette widths of the provided objects for each number of clusters. Further, a list with two elements is returned:

Silhouette_Widths

A data frame with the silhouette widths for each object and the average silhouette widths per number of clusters

Optimal_Nr_of_CLusters

The determined optimal number of cluster

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
## Not run: 
data(fingerprintMat)
data(targetMat)

L=list(fingerprintMat,targetMat)

NrClusters=SelectnrClusters(List=L,type="data",distmeasure=c("tanimoto",
"tanimoto"),nrclusters=seq(5,10),normalize=c(FALSE,FALSE),method=c(NULL,NULL),
names=c("FP","TP"),StopRange=FALSE,plottype="new",location=NULL)

NrClusters

## End(Not run)

IntClust documentation built on May 2, 2019, 5:51 a.m.