Cluster distance matrix from subsampling

Share:

Description

Given a n x n matrix of distances, these functions will try to find the clusters based on the given clustering function. cluster01 and clusterK are internal functions and clusterD is a wrapper around these two functions for easier user interface. cluster01 and clusterK are not expected to be called directly by the user, except for ease in debugging user-defined clustering functions.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
clusterD(x = NULL, diss = NULL, clusterFunction = c("hierarchical01",
  "tight", "pam", "hierarchicalK"), typeAlg = c("01", "K"),
  distFunction = NA, minSize = 1, orderBy = c("size", "best"),
  format = c("vector", "list"), clusterArgs = NULL, checkArgs = TRUE,
  returnD = FALSE, ...)

cluster01(diss, clusterFunction = c("hierarchical01", "tight"), alpha = 0.1,
  clusterArgs = NULL, checkArgs)

clusterK(diss, clusterFunction = c("pam", "hierarchicalK"),
  findBestK = FALSE, k, kRange, removeSil = FALSE, silCutoff = 0,
  clusterArgs = NULL, checkArgs)

Arguments

x

p x n data matrix on which to run the clustering (samples in columns).

diss

n x n data matrix of dissimilarities between the samples on which to run the clustering

clusterFunction

clusterFunction a function that clusters a nxn matrix of dissimilarities/distances. Can also be given character values to indicate use of internal wrapper functions for default methods. See Details for the format of what the function must take as arguments and what format the function must return.

typeAlg

character value of either '01' or 'K' determining whether the function given in clusterFunction should be called by clusterK or cluster01. Only used if clusterFunction is a user-defined function. Otherwise, for methods provided by the package (i.e. by user setting clusterFunction to a character value) clusterD will determine the appropriate input for 'typeAlg' and will ignore user input.

distFunction

a distance function to be applied to D. Only relevant if input D is a matrix of data, rather than a distance. See details.

minSize

the minimum number of samples in a cluster. Clusters found below this size will be discarded and samples in the cluster will be given a cluster assignment of "-1" to indicate that they were not clustered.

orderBy

how to order the cluster (either by size or by maximum alpha value).

format

whether to return a list of indices in a cluster or a vector of clustering assignments. List is mainly for compatibility with sequential part.

clusterArgs

arguments to be passed directly to the clusterFunction, beyond the required input.

checkArgs

logical as to whether should give warning if arguments given that don't match clustering choices given. Otherwise, inapplicable arguments will be ignored without warning.

returnD

logical as to whether to return the D matrix in output.

...

arguments given to clusterD to be passed to cluster01 or clusterK (depending on the value of typeAlg). Examples include 'k' for clusterK or 'alpha' for cluster01. These should not be the arguments needed by clusterFunction (which should be passed via the argument 'clusterArgs') but the actual arguments of cluster01 or clusterK.

alpha

a cutoff value of how much similarity needed for drawing blocks (lower values more strict).

findBestK

logical, whether should find best K based on average silhouette width (only used if clusterFunction of type "K").

k

single value to be used to determine how many clusters to find, if findBestK=FALSE (only used if clusterFunction of type "K").

kRange

vector of integers. If findBestK=TRUE, this gives the range of k's to look over. Default is k-2 to k+20, subject to those values being greater than 2. Note that default values depend on the input k, so running for different choices of k and findBestK=TRUE can give different answers unless kRange is set to be the same.

removeSil

logical as to whether remove when silhouette < silCutoff (only used if clusterFunction of type "K")

silCutoff

Requirement on minimum silhouette width to be included in cluster (only if removeSil=TRUE).

Details

To provide a distance matrix via the argument distFunction, the function must be defined to take the distance of the rows of a matrix (internally, the function will call distFunction(t(x)). This is to be compatible with the input for the dist function. as.matrix will be performed on the output of distFunction, so if the object returned has a as.matrix method that will convert the output into a symmetric matrix of distances, this is fine (for example the class dist for objects returned by dist have such a method). If distFunction=NA, then a default distance will be calculated based on the type of clustering algorithm of clusterFunction. For type "K" the default is to take dist as the distance function. For type "01", the default is to take the (1-cor(x))/2.

Types of algorithms: cluster01 is for clustering functions that expect as an input D that takes on 0-1 values (e.g. from subclustering). clusterK is for clustering functions that require an input k, the number of clusters, but arbitrary distance/dissimilarity matrix. cluster01 and clusterK are given as separate functions in order to allow the user to provide different clustering functions that expect different types of input and for us to provide different shared processing of the results that is different for these different types of clustering methods (for example, removing low silhouette values is appropriate for clusterK clustering functions rather than cluster01 functions). It is also generally expected that cluster01 algorithms use the 0-1 nature of the input to set criteria as to where to find clusters and therefore do not need a pre-determined 'k'. On the other hand, clusterK functions are assumed to need a predetermined 'k' and are also assumed to cluster all samples to a cluster, and therefore clusterK gives options to exclude poorly clustered samples via silhouette distances.

cluster01 required format for input and output for clusterFunction: clusterFunction should be a function that takes (as a minimum) an argument "D" and "alpha". 0-1 clustering algorithms are expected to use the fact that the D input is 0-1 range to find the clusters, rather than a user defined number of clusters; "alpha" is the parameter that tunes the finding of such clusters. For example, a candidate block of samples might be considered a cluster if all values of D are greater than or equal to 1-alpha. The output is a list with each element corresponding to a cluster and the elements of the list corresponding to the indices of the samples that are in the cluster. The list is expected to be in order of 'best clusters' (as defined by the clusterFunction), with first being the best and last being worst.

cluster01 methods: "tight" method refers to the method of finding clusters from a subsampling matrix given internally in the tight algorithm code of Tsang and Wong. Arguments for the tight method are 'minSize.core' (default=2), which sets the minimimum number of samples that form a core cluster. "hierarchical01" refers to running the hclust algorithm on D and transversing down the tree until getting a block of samples with whose summary of the values is greater than or equal to 1-alpha. Arguments that can be passed to 'hierarchical' are 'evalClusterMethod' which determines how to summarize the samples' values of D[samples,samples] for comparison to 1-alpha: "maximum" (default) takes the minimum of D[samples,samples] and requires it to be less than or equal to 1-alpha; "average" requires that each row mean of D[samples,samples] be less than or equal to 1-alpha. Arguments of hclust can also be passed via clusterArgs to control the hierarchical clustering of D.

clusterK required format for input and output for clusterFunction: clusterFunction should be a function that takes as a minimum an argument 'D' and 'k'. The output must be a clustering, specified by integer values. The function silhouette will be used on the clustering to calculate silhouette scores for each observation.

clusterK methods: "pam" performs pam clustering on the input D matrix using pam in the cluster package. Arguments to pam can be passed via 'clusterArgs', except for the arguments 'x' and 'k' which are given by D and k directly. "hierarchicalK" performs hierarchical clustering on the input via the hclust and then applies cutree with the specified k to obtain clusters. Arguments to hclust can be passed via clusterArgs.

Value

clusterD returns a vector of cluster assignments (if format="vector") or a list of indices for each cluster (if format="list"). Clusters less than minSize are removed. If orderBy="size" the clusters are reordered by the size of the cluster, instead of by the internal ordering of the clusterFunction.

cluster01 and clusterK return a list of indices of the clusters found, which each element of the list corresponding to a cluster and the elements of that list a vector of indices giving the indices of the samples assigned to that cluster. Indices not included in any list are assumed to have not been clustered. The list is assumed to be ordered in terms of the ‘best’ cluster (as defined by the clusterFunction for cluster01 or by average silhoute for clusterK), for example in terms of most internal similarity of the elements, or average silhouette width.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
data(simData)
cl1<-clusterD(simData,clusterFunction="pam",k=3)
cl2<-clusterD(simData,clusterFunction="hierarchical01")
cl3<-clusterD(simData,clusterFunction="tight")
#change distance to manhattan distance
cl4<-clusterD(simData,clusterFunction="pam",k=3,
     distFunction=function(x){dist(x,method="manhattan")})

#run hierarchical method for finding blocks, with method of evaluating
#coherence of block set to evalClusterMethod="average", and the hierarchical
#clustering using single linkage:
clustSubHier <- clusterD(simData, clusterFunction="hierarchical01", alpha=0.1,
minSize=5, clusterArgs=list(evalClusterMethod="average", method="single"))

#do tight
clustSubTight <- clusterD(simData, clusterFunction="tight", alpha=0.1,
minSize=5)

#two twists to pam
clustSubPamK <- clusterD(simData, clusterFunction="pam", silCutoff=0, minSize=5,
removeSil=TRUE, k=3)
clustSubPamBestK <- clusterD(simData, clusterFunction="pam", silCutoff=0,
minSize=5, removeSil=TRUE, findBestK=TRUE, kRange=2:10)

# note that passing the wrong arguments for an algorithm results in warnings
# (which can be turned off with checkArgs=FALSE)
clustSubTight_test <- clusterD(simData, clusterFunction="tight", alpha=0.1,
minSize=5, removeSil=TRUE)
clustSubTight_test2 <- clusterD(simData, clusterFunction="tight", alpha=0.1,
clusterArgs=list(evalClusterMethod="average"))