ClusteringAggregation: Clustering aggregation

Description Usage Arguments Details Value References Examples

Description

The ClusteringAggregation includes the ensemble clustering methods Balls, Agglomerative (Aggl.) and Furthest which are graph-based consensus methods.

Usage

1
2
3
4
5
6
ClusteringAggregation(List, type = c("data", "dist", "clust"),
  distmeasure = c("tanimoto", "tanimoto"), normalize = c(FALSE, FALSE),
  method = c(NULL, NULL), clust = "agnes", linkage = c("flexible",
  "flexible"), alpha = 0.625, nrclusters = c(7, 7), gap = FALSE,
  maxK = 15, agglMethod = c("Balls", "Aggl", "Furthest", "LocalSearch"),
  improve = TRUE, distThresh_B = 0.5, distThresh_A = 0.8)

Arguments

List

A list of data matrices. It is assumed the rows are corresponding with the objects.

type

indicates whether the provided matrices in "List" are either data matrices, distance matrices or clustering results obtained from the data. If type="dist" the calculation of the distance matrices is skipped and if type="clusters" the single source clustering is skipped. Type should be one of "data", "dist" or "clusters".

distmeasure

A vector of the distance measures to be used on each data matrix. Should be one of "tanimoto", "euclidean", "jaccard", "hamming". Defaults to c("tanimoto","tanimoto").

normalize

Logical. Indicates whether to normalize the distance matrices or not, defaults to c(FALSE, FALSE) for two data sets. This is recommended if different distance types are used. More details on normalization in Normalization.

method

A method of normalization. Should be one of "Quantile","Fisher-Yates", "standardize","Range" or any of the first letters of these names. Default is c(NULL,NULL) for two data sets.

clust

Choice of clustering function (character). Defaults to "agnes".

linkage

Choice of inter group dissimilarity (character) for each data set. Defaults to c("flexible", "flexible") for two data sets.

alpha

The parameter alpha to be used in the "flexible" linkage of the agnes function. Defaults to 0.625 and is only used if the linkage is set to "flexible"

nrclusters

The number of clusters to divide each individual dendrogram in. Default is c(7,7) for two data sets.

gap

Logical. Whether the optimal number of clusters should be determined with the gap statistic. Default is FALSE.

maxK

The maximal number of clusters to investigate in the gap statistic. Default is 15.

agglMethod

The method to be performed: "Balls","Aggl","Furthest" or "LocalSearch".

improve

Logical. If TRUE, a local search is performed to improve the obtained results. Default is TRUE.

distThresh_B

A distance threshold for the Balls algoritme. Default is 0.5.

distThresh_A

A distance threshold for the Aggl. algoritme. Default is 0.8.

Details

\insertCite

Gionis2007IntClust propose heuristic algorithms in order to find a solution for the consensus problem. In a first step, a weighted graph is built from the objects with weights between two vertices determined by the fraction of clusterings that place the two vertices in different clusters. In a second step, an algorithm searches for the partition that minimizes the total number of disagreements with the given partitions. The Balls algorithm is an iterative process which finds a cluster for the consensus partition in each iteration. For each object $i$, all objects at a distance of at most 0.5 are collected and the average distance of this set to the $i$th object of interest is calculated. If the average distance is less or equal to a parameter $alpha$ the objects form a cluster; otherwise the object forms a singleton. The Agglomerative (Aggl.) algorithm starts by considering every object as a singleton cluster. Next, the two closest clusters are merged if the average distance between the clusters is less than 0.5. If there are no two clusters with an average distance smaller than 0.5, the algorithm stops and returns the created clusters as a solution. The Furthest algorithm starts by placing all objects into a single cluster. In each iteration, the pair of objects that are the furthest apart are considered as new cluster centers. The remaining objects are appointed to the center that increases the cost of the partition the least and the new cost is computed. The cost is the sum of the all distances between the obtained partition and the partitions in the ensemble. The iteration continues until the cost of the new partition is higher than the previous partition.

Value

The returned value is a list of two elements:

DistM

A NULL object

Clust

The resulting clustering

The value has class 'Ensemble'.

References

\insertRef

Gionis2007IntClust

Examples

1
2
3
4
5
6
7
8
data(fingerprintMat)
data(targetMat)
L=list(fingerprintMat,targetMat)

MCF7_Aggl=ClusteringAggregation(List=L,type="data",distmeasure=c("tanimoto","tanimoto"),
normalize=c(FALSE,FALSE),method=c(NULL,NULL),clust="agnes",linkage = c("flexible",
"flexible"),alpha=0.625,nrclusters=c(7,7),gap = FALSE, maxK = 15,agglMethod="Aggl",
improve=TRUE,distThresh_B=0.5,distThresh_A=0.8)

IntClust documentation built on May 2, 2019, 5:51 a.m.