EnsembleClustering: Ensemble clustering

Description Usage Arguments Details Value References Examples

Description

The EnsembleClustering includes the ensemble clustering methods CSPA, HGPA and MCLA which are graph-based consensus methods.

Usage

1
2
3
4
5
6
EnsembleClustering(List, type = c("data", "dist", "clust"),
  distmeasure = c("tanimoto", "tanimoto"), normalize = c(FALSE, FALSE),
  method = c(NULL, NULL), clust = "agnes", linkage = c("flexible",
  "flexible"), alpha = 0.625, nrclusters = 7, gap = FALSE, maxK = 15,
  ensembleMethod = c("CSPA", "HGPA", "MCLA", "Best"), waitingtime = 300,
  file_number = 0, executable = FALSE)

Arguments

List

A list of data matrices. It is assumed the rows are corresponding with the objects.

type

indicates whether the provided matrices in "List" are either data matrices, distance matrices or clustering results obtained from the data. If type="dist" the calculation of the distance matrices is skipped and if type="clusters" the single source clustering is skipped. Type should be one of "data", "dist" or "clusters".

distmeasure

A vector of the distance measures to be used on each data matrix. Should be one of "tanimoto", "euclidean", "jaccard", "hamming". Defaults to c("tanimoto","tanimoto").

normalize

Logical. Indicates whether to normalize the distance matrices or not, defaults to c(FALSE, FALSE) for two data sets. This is recommended if different distance types are used. More details on normalization in Normalization.

method

A method of normalization. Should be one of "Quantile","Fisher-Yates", "standardize","Range" or any of the first letters of these names. Default is c(NULL,NULL) for two data sets.

clust

Choice of clustering function (character). Defaults to "agnes".

linkage

Choice of inter group dissimilarity (character) for each data set. Defaults to c("flexible,", "flexible") for two data sets.

alpha

The parameter alpha to be used in the "flexible" linkage of the agnes function. Defaults to 0.625 and is only used if the linkage is set to "flexible"

nrclusters

The number of clusters to divide each individual dendrogram in. Default is c(7,7) for two data sets.

gap

Logical. Whether the optimal number of clusters should be determined with the gap statistic. Default is FALSE.

maxK

The maximal number of clusters to investigate in the gap statistic. Default is 15.

ensembleMethod

The method to be performed: "CSPA", "HGPA", "MCLA" or "Best".

waitingtime

The time in seconds to wait until the MATLAB results are generated. Defaults to 300.

file_number

The specific file number to be placed as a tag in the file generated by MATLAB. Defaults to 00.

executable

Logical. Whether the MATLAB functions are performed via an executable on the command line (TRUE, only possible for Linux systems) or by calling on MATLAB directly (FALSE). Defaults to FALSE. The files EnsembleClusteringC.m (CSPA), EnsembleClusteringH.m (HGPA), EnsembleClusteringM.m (MCLA) and MetisAlgorithm.m are present in the inst folder to be transformed in executables.

Details

\insertCite

Strehl2002IntClust introduce three heuristic algorithms to solve the cluster ensemble problem. Each method starts by transforming the clustering solutions into a single hypergraph in which a hyperedge represents a single cluster. The Cluster-based Similarity Partitioning Algorithm (CSPA) transforms the hypergraph into an overall similarity matrix which entries represent the fraction of clusterings in which two objects are in the same cluster. The similarity matrix is considered as a graph and the objects are reclustered with the graph partitioning algorithm METIS \insertCiteKarypis1998IntClust. Hyper-Graph Partitioning Algorithm (HGPA) partitions the hypergraph directly by cutting a minimal number of hyperedges. It aims to obtain connected components of approximately the same dimension. The partitioning algorithm is HMetis \insertCiteKarypis1997IntClust. The Meta-CLustering Algorithm (MCLA) computes a similarity between the hyperedges (clusters) based on the Jaccard index. The resulting similarity matrix is used to build a meta-graph which is partitioned by the METIS algorithm \insertRefKarypis1998IntClust into resulting meta-clusters. The final partition of the objects is obtaining by appointing each object to the meta-cluster to which it is assigned the most. The R code calls on the MATLAB code provided by \insertCiteStrehl2002IntClust. The MATLAB functions are included in the inst folder and should be located in the working directory. Shell script for the executable can be found in the inst folder as well.

Value

The returned value is a list of two elements:

DistM

A list with the distance matrix for each data structure

Clust

The resulting clustering

The value has class 'Ensemble'.

References

\insertRef

Strehl2002IntClust \insertRefKarypis1997IntClust \insertRefKarypis1998IntClust

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
## Not run: 
data(fingerprintMat)
data(targetMat)
L=list(fingerprintMat,targetMat)

MCF7_CSPA=EnsembleClustering(List=L,type="data",distmeasure=c("tanimoto",
"tanimoto"),normalize=c(FALSE,FALSE),method=c(NULL,NULL),StopRange=FALSE,
clust="agnes",linkage=c("flexible","flexible"),nrclusters=c(7,7),gap=FALSE,
maxK=15,ensembleMethod="CSPA",executable=FALSE)


## End(Not run)

IntClust documentation built on May 2, 2019, 5:51 a.m.