knitr::opts_chunk$set(cache=FALSE)
processBoxes<-function(type,nm){
 # labWidth<-max(strwidth(nm),xadj)
#  labHeight<-max(strheight(nm),yadj)
  usrCoord<-par("usr")
  labWidth<-max(sapply(strsplit(nm,"\n"),strwidth,cex=1.5))#+0.005*diff(usrCoord[1:2])
  labWidthPerct<-labWidth/diff(usrCoord[1:2])
  #labWidth<-strwidth(nm)*1.2
  labHeight<-strheight(nm,cex=1.5)+0.005*diff(usrCoord[3:4])
  if(is.na(type)){
      }
  else{
    if(type=="function"){
      textellipse(elpos[i,], rady=labHeight,radx=labWidth,lab =nm , box.col = "green", shadow.col = NA, shadow.size = 0, cex = 1.5)

    }
    if(type=="arg"){
      textdiamond(elpos[i,], rady=labHeight*2,radx=labWidth,lab =nm , box.col = "grey", shadow.col = NA, shadow.size = 0, cex = 1.5)

    }
    if(type=="description"){
      textrect(elpos[i,], rady=labHeight*.75,radx=labWidth*.75,lab =nm , box.col = "white", shadow.col = NA, shadow.size = 0, cex = 1.5)

    }
    if(type=="break"){
      textmulti(mid = elpos[i,], nr = 3, radx = labHeight*3, rady = labWidth, lab = nm,
box.col="yellow",cex=1.5,angle=270, shadow.col = NA, shadow.size = 0) #inverted because angle is 270
    }
    if(type=="stop"){
      textmulti(mid = elpos[i,], nr = 8, radx = labWidth, rady = labWidthPerct*diff(usrCoord[3:4])*.75, lab = nm,
box.col="red",cex=1.5, shadow.col = NA, shadow.size = 0)
    }

  }

}
processArrows<-function(coord1,coord2,nm){
    pos<-straightarrow (to = coord2, from = coord1, lwd = 2, arr.pos = 0.6, arr.length = 0.5)
    if(!is.na(nm)){
      text(pos[ 1], pos[ 2], nm,pos=if(coord1[1]<=coord2[1]) 4 else 2,offset=1)
    }
}

The goal of this package is to allow the user to try many different clustering algorithms in one package structure. In particular, the package encodes many common practices that are shared across clustering algorithms, like subsampling the data, not clustering samples with negative silhouete scores, sequentially removing clusters and reclustering, and so forth, and allows the user to simply make different choices the parameters as well as the underlying clustering algorithm. The package also allows the user to define their own clustering algorithm.

There are two main user-functions, clusterSingle and clusterMany. clusterSingle is the wrapper function that calls the underlying functions and clusterMany runs clusterSingle across combinations of parameter choices. Another useful function is subsampleClustering which does subsampling of the data and returns a $nxn$ matrix $D$ of the proportion of co-occurances of samples in the same cluster.

There are three main choices that the user needs to make for clusterSingle:

The clusterSingle calls underlying functions for each of these tasks, each of which can take many arguments. For simplicity in syntax, the user passes arguments to each of these underlying functions as lists of arguments.

For this reason, the next sections will go through the functions that form the components of clusterSingle so as to better explain the possible options that can be passed.

Basic Clustering Sequence

First we will discuss if sequential=FALSE, i.e. the user does not choose to use the sequential removal of clusters. Then the algorithm performs what we call the "basic clustering sequence" (this basic sequence is also called iteratively by the sequential method).

There are two parts to the basic clustering sequence:

We will load some simulated data that comes with the package with four underlying clusters. First we will do just basic clustering of the data for 4 clusters using pam on the standard distance matrix.

library(clusterExperiment)
data(simData)
simpleCluster<-clusterSingle(simData, subsample=FALSE, 
                          sequential=FALSE, clusterFunction="pam",mainClusterArgs=list('k'=4))
#compare to direct call of pam
library(cluster)
pamCluster<-pam(dist(simData),4)
table(pamCluster$clustering,simpleCluster$clustering)

Of course, this is not the best use case, since the pam object from pam has much more information. Let's instead choose to do subsampling, and then cluster the co-occurance matrix.

#library(clusterPackage)
subsampleCluster<-clusterSingle(simData, subsample=TRUE, 
                             sequential=FALSE, clusterFunction="pam",mainClusterArgs=list('k'=4))
table(pamCluster$clustering,subsampleCluster$clustering)

Notice that we got a warning. This is because we could have chosen to cluster with a different number of clusters on the subsampled data than we clustered the co-occurance data. Similarly, the 'clusterFunction' argument is passed to clusterD -- i.e. the clustering that is done on the co-occurance matrix. We could use something different for clustering of the subsampled data. For example, the following clusters the subsampled data with k=2 and use kmeans, but then clusters the co-occurance matrix with k=4 and pam.

subsampleCluster2<-clusterSingle(simData, subsample=TRUE, sequential=FALSE,
                              clusterFunction="pam",mainClusterArgs=list('k'=4),
                              subsampleArgs=list("k"=2, clusterFunction="kmeans"))
table(subsampleCluster2$clustering,subsampleCluster$clustering)

If we want to see the co-occurance matrix, rather than just cluster it, we need to call subsampleClustering directly, in which case we can plot it and visualize the results.

subD<-subsampleClustering(simData, clusterFunction="pam",k=2)
require(NMF)
aheatmap(subD)

clusterSingle Diagram

The following image shows how these clustering functions are put together by clusterSingle to form the basic clustering sequence, where the functions are indicated by ovals in the flowchart.

#results = 'asis'
#cat('\n<img src="flowchartBasicClustering.png" width="300" height="300" />\n')
yadj<-0.05
xadj<-yadj
library(diagram)
openplotmat()
#####
#make node matrix
#####
elpos  <- coordinates (c(1,2,2,4,4))
rnames<-c("1:subsample","2:clusterD","3:subsampleClustering","4:typeAlg","5:clusterD","6:cluster01","7:clusterK","8:typeAlg","8:typeAlg","remove","remove","9:cluster01","10:clusterK")
row.names(elpos)<-rnames
#made 'fake' so nice. need to take average
whDup<-"8:typeAlg"
newCoordX<-mean(elpos[which(rnames==whDup),1])
newCoordY<-unique(elpos[which(rnames==whDup),2])
whRm<-which(rnames%in%c(whDup,"remove"))
elpos<-elpos[-whRm,]
elpos<-rbind(elpos,"8:typeAlg"=c(newCoordX,newCoordY))
typeBox<-c("1:subsample"="arg","2:clusterD"="function","3:subsampleClustering"="function","4:typeAlg"="arg","5:clusterD"="function","6:cluster01"="function","7:clusterK"="function","8:typeAlg"="arg","9:cluster01"="function","10:clusterK"="function")
typeBox<-typeBox[match(row.names(elpos),names(typeBox))]
#####
#make edge matrix
#####
fromtoChar<-rbind(c("1:subsample","2:clusterD"),c("1:subsample","3:subsampleClustering"),c("2:clusterD","4:typeAlg"),c("4:typeAlg","6:cluster01"),c("4:typeAlg","7:clusterK"),c("3:subsampleClustering","5:clusterD"),c("5:clusterD","8:typeAlg"),c("8:typeAlg","9:cluster01"),c("8:typeAlg","10:clusterK")
    )
fromto <- cbind(match(fromtoChar[,1],row.names(elpos)),match(fromtoChar[,2],row.names(elpos)))
row.names(fromto)<-c("FALSE","TRUE",NA,'"01"','"K"',NA,NA,'"01"','"K"')
nr     <- nrow(fromto)
############
###Draw Arrows:
############
arrpos <- matrix(ncol = 2, nrow = nr)
for (i in 1:nrow(fromto)){
  processArrows(elpos[fromto[i, 1], ],elpos[fromto[i, 2], ],row.names(fromto)[i])
}

###########
## Draw boxes
###########
nicenames<-sapply(strsplit(row.names(elpos),":"),.subset2,2)
for(i in 1:nrow(elpos)){
  processBoxes(typeBox[i],nicenames[i])
}

We will go through each of these functions in what follows.

Subsampling and Clustering subsampled data.

The subsampling of the data and clustering of resampled data is done by the function subsampleClustering. It takes as input a $n\times p$ data matrix, with samples on the rows. It subsamples the samples (i.e. rows) so that samp.p proportion of the samples are in each subsample. The subsample is then clustered (based on the clusterFunction argument). clusterFunction can be a user-specified function, but can also take on character values to indicate that the function should use the built-in clustering functions (currently 'kmeans' and 'pam').

This subsampling and clustering is done resamp.num times. The end results is a $n\times n$ co-occurance matrix of the proportion of times that each sample was in the same cluster (out of those subsamples where both samples were chosen).

The function also allows for co-occurance to be based not on just those samples that subsample together, but on all samples by assigning the non-subsampled samples to a cluster by setting the option classifyMethod="All". In this case, the co-occurance calculation is based not directly on the clustering of the subsampled data, but the assignment of all samples to clusters, which is done via the classifyFunction function (again either user-defined, or set automatically when the user choosing 'kmeans' or 'pam' for the clusterFunction option). Similarly the user can choose classifyMethod="OutOfSample" to indicate that co-occurance should be based only on when two samples are both not included in a subsample (in which case the samples are assigned to a cluster via classifyFunction, as described above).

Clustering

clusterD is the basic function that clusters an input $n x n$ dissimilarity matrix $D$ based on the clusterFunction option given by the user (which can be either a character string of pre-detemined options or a user-defined functions). There are (currently) two flavors of clusterD supported, depending on the nature of the dissimilarity matrix $D$ -- meaning that clusterD calls one of two different functions (cluster01 and clusterK) to do the clustering. If the user chooses one of the default clustering methods by setting clusterFunction to be a character value (one of "pam","tight",or "hierarchical01") then clusterD automatically recognizes what kind of clustering algorithm ('01' or 'K') is being used and calls it.

However, if the user defines a clustering algorithm in their input to clusterFunction, then the user must also give the argument typeAlg to indicate what type of clustering function they are provided. This is because the output of '01' and 'K' clusterings are treated differently by the function.

Each type of clustering algorithm has minimal requirements for the arguments to clusterFunction (described below). Additional arguments can be passed via the clusterArgs function, as in subsampleClustering.

cluster01

cluster01 is intended for clustering algorithms that assume the elements of the input matrix $D$ are in $[0,1]$ and use the value of the elements of $D$ to determine the clusters (i.e. the clustering algorithm doesn't require an input $K$ to define the number of clusters). The intended use-case here is when $D$ is based on subsampling the samples and clustering them and so that $D$ is a co-occurence matrix.

If the clusterFunction is of type 01, then the following additional arguments can be passed to clusterD: alpha. alpha is a required parameter for 01 arguments and is by default set to 0.01.

Built-in clustering functions for '01' type algorithms for include 'tight' and 'hierarchical', both of which search for indices of samples $\mathcal{I}$ of samples for which the submatrix $D[\mathcal{I},\mathcal{I}]$ has values close to $1$ in order to determine a cluster. In this way, instead of requiring the number of clusters, there is a tuning parameter $\alpha\in(0,1)$ which determines if the $D[\mathcal{I},\mathcal{I}]$ values are sufficiently close to $1$.

clusterFunction should be a function that takes (as a minimum) an argument D and alpha. Additional arguments can be present and passed to clusterFunction via the clusterArgs option of clusterD. The output of clusterFunction must be a list with each element of the list corresponding to a cluster, and the elements of the list containing the indices of the samples that are in the cluster. The list is expected to be in order of 'best clusters' (as defined by the clusterFunction), with first being the best and last being worst.

Built-in 01 Methods: Setting clusterFunction="tight" refers to the method of finding clusters from a subsampling matrix given in the tight algorithm of Tsang and Wong. The code is copied from the tight.clust package (note that the description in the paper is vague, see code to see how it works). The function starts by finding a core cluster of samples which have similarity 1, and then adding samples to that core cluster that all have similarity greater than or equal to 1-alpha. Arguments for the tight method are minSize.core (default=2), which sets the minimimum number of samples that form a core cluster -- i.e. how many samples must have similarity exactly 1 to start building a cluster.

Setting clusterFunction="hierarchical01" refers to running the hclust algorithm on D and transversing down the hierarchical tree until getting a block of samples $\mathcal{I}$ such that $f(D[\mathcal{I},\mathcal{I}])$ is greater than or equal to 1-alpha. Arguments that can be passed to 'hierarchical' are evalClusterMethod which determines the function $f$ -- i.e. how to summarize the values of $D[\mathcal{I},\mathcal{I}]$ in order to determine if the block of samples is sufficiently similar. If evalClusterMethod="minimum" (default) takes as $f$ simply the minimum of $D[\mathcal{I},\mathcal{I}]$ and requires it to be greater than or equal to 1-alpha to be a cluster. evalClusterMethod="average" takes $f$ to be the minimum of the row means of $D[\mathcal{I},\mathcal{I}]$. Arguments to hclust can also be passed via clusterArgs to control the hierarchical clustering of D.

clusterK

clusterK is intended for algorithms that expect the user to give a value $K$ giving the number of clusters. This function then includes options appropriate for such algorithms, such as searching over a range of values $K$ and returning the 'best' cluster, or removing samples with low 'silhouette' values.

If the clusterFunction is of type K, then additional arguments can be passed to clusterD: k,findBestK, removeSil,kRange,silCutoff (see below for description)

For clusterK functions, clusterFunction should be a function that takes as a minimum an argument 'D' and 'k'. The output must be a list similar to that of partition.object of cluster package. Specifically, an element clustering which gives the vector of clusters; and an argument silinfo like that of the partition.object -- that is a list with silhouette values.

Whether these silhouette values are actually silhouette values is up to the clusterFunction, but they will be used in the following way. If findBestK=TRUE, silinfo$avg.width will be used to pick the best $K$ and the set of values of $K$ that will be searched over is given by kRange. If removeSil=TRUE, then silinfo$widths[,"sil_width"] will be used to exclude samples whose value of silinfo$widths[,"sil_width"] is less than silCutoff. In addition, clusters will be ordered by the average of the values silinfo$widths[,"sil_width"] in each cluster (after removing poorly clustered samples, if removeSil=TRUE).

Built-in clusterK methods: clusterFunction="pam" performs pam clustering on the input D using pam in the cluster package. Arguments to pam can be passed via clusterArgs, except for the arguments x and k which are given by D and k directly in the call to clusterD.

Sequential option

The clusterSingle function also provides the ability to do sequential clustering, namely clustering the data, removing the cluster, and reclustering the remaining data. Since clustering algorithms can be sensitive to what set of samples are being used, this can have an impact on the clustering results. Furthermore, it can avoid the question of setting the number of clusters k, and instead find the number of clusters as when the sequential strategy doesn't find any more clusters (or enough samples) to cluster. Of course the stopping rules themselves requires tuning parameters, so is only a gain if the stopping rules are more intuitive or defensible than picking the number of clusters.

The sequential strategy that is currently implemented follows the 'tight' algorithm of XXX[add citation], though we have modified its implementation to be more appropriate to clustering of samples, rather than the gene clustering setting of the original tight algorithm, as well as allowing for many more choices.

In short, the sequential strategy iterates the basic clustering strategy described above, continually increasing the k parameter of subsampleClustering until there exists a stable cluster when $k$ is increased to $k+1$. A stable cluster is defined as finding one cluster from the clusters using $k$ whose membership is similar enough to that of one of the clusters found for $k-1$, where 'similar' enough is defined by the parameter beta, $$\text{Put equation here for similarity}$$

Once such a cluster is found, the samples in the cluster are removed, and the process of finding another stable is restarted.

Sequential Clustering Diagram

The following image represents how the sequential strategy works:

#results = 'asis'}
#cat('\n<img src="flowchartTop.png" width="300" height="300" />\n')
yadj<-0.05
xadj<-yadj
require(diagram)
par(mar=c(.1,.1,.1,1))
openplotmat()
#####
#make node matrix
#####
elpos  <- coordinates (c(1,1,rep(2,4),4,4,4,8))
rnames<-c("1:clusterSingle","2:sequential","3:Basic Clustering\n Routine","4:seqCluster","remove","5:Set k.start=k0\nn=No.samples","remove","6:Iterate Basic\nClustering Routine\nk=k.start,kstart+1,...","remove","7:Break if",
          #"remove","remove","8:k=k.max &\n no similarity < beta", "9:similarity of\n(k,k-1)<beta",
          "remove","remove","10:Stop","11:Cluster Found\nRemove Samples","remove","remove","remove","12:Reset n","remove","remove","remove","13:n<remain.n",rep("remove",6),"14:Stop","15:Restart with\nk.start=k.start-1\n(but no less than k.min)")
row.names(elpos)<-rnames
elpos<-elpos[-which(row.names(elpos)=="remove"),]
elpos[,1]<-elpos[,1]-min(elpos[,1])*.4
elpos["14:Stop",1]<-elpos["14:Stop",1]-.1

typeBox<-c("1:clusterSingle"="function","2:sequential"="arg","3:Basic Clustering\n Routine"="description","4:seqCluster"="function","5:Set k.start=k0\nn=No.samples"="description","6:Iterate Basic\nClustering Routine\nk=k.start,kstart+1,..."="description","7:Break if"="break",
   #"8:k=k.max &\n no similarity < beta"="description",    "9:similarity of\n(k,k-1)<beta"="description" ,
   "10:Stop"="stop","11:Cluster Found\nRemove Samples"="description",   "12:Reset n"="description","13:n<remain.n"="arg","14:Stop"="stop","15:Restart with\nk.start=k.start-1\n(but no less than k.min)"="description")
typeBox<-typeBox[match(row.names(elpos),names(typeBox))]
#####
#make edge matrix
#####

fromtoChar<-rbind(c("1:clusterSingle","2:sequential"),c("2:sequential","3:Basic Clustering\n Routine"),c("2:sequential","4:seqCluster"),
                  c("4:seqCluster","5:Set k.start=k0\nn=No.samples"),c("5:Set k.start=k0\nn=No.samples","6:Iterate Basic\nClustering Routine\nk=k.start,kstart+1,..."),c("6:Iterate Basic\nClustering Routine\nk=k.start,kstart+1,...","7:Break if"),
                  c("7:Break if","10:Stop"), c("7:Break if","11:Cluster Found\nRemove Samples"),c("11:Cluster Found\nRemove Samples","12:Reset n"),c("12:Reset n","13:n<remain.n"),c("13:n<remain.n","14:Stop"),c("13:n<remain.n","15:Restart with\nk.start=k.start-1\n(but no less than k.min)")
)
                           #c("7:Break if","8:k=k.max &\n no similarity < beta"),c("7:Break if","9:similarity of\n(k,k-1)<beta"),c("8:k=k.max &\n no similarity < beta","10:Stop"),c("9:similarity of\n(k,k-1)<beta","11:Cluster Found\nRemove Samples"),                                   
fromto <- cbind(match(fromtoChar[,1],row.names(elpos)),match(fromtoChar[,2],row.names(elpos)))
row.names(fromto)<-c(NA,"FALSE","TRUE",rep(NA,3),"k=k.max &\n no similarity < beta","similarity of\n(k,k-1)<beta",rep(NA,2),"TRUE","FALSE")
############
###Draw Arrows:
############
for (i in 1:nrow(fromto)){
  processArrows(elpos[fromto[i, 1], ],elpos[fromto[i, 2], ],row.names(fromto)[i])
}
curvedarrow(from = elpos["15:Restart with\nk.start=k.start-1\n(but no less than k.min)", ], to = elpos["5:Set k.start=k0\nn=No.samples", ]+c(.05,0),  lty = 1, lwd=5, lcol = 2,arr.pos=0.8,curve=0.5,arr.lwd=8,arr.length=0.6)
selfarrow(pos=elpos["6:Iterate Basic\nClustering Routine\nk=k.start,kstart+1,...", ]+c(.1,0), path="R", lty = 1, lwd=2, lcol = 2,arr.pos=.2,curve=c(0.1,0.05))
###########
## Draw boxes
###########
nicenames<-sapply(strsplit(row.names(elpos),":"),.subset2,2)
for(i in 1:nrow(elpos)){
  processBoxes(typeBox[i],nicenames[i])
}

Compare many parameters (clusterMany)

clusterMany is a function for running many different combinations of parameters or different datasets. It's main goal is to make it easy to running many different parameter choices and reduce the coding overhead for doing so. clusterMany calls clusterSingle using lapply (or mclapply if parallelizing the computation). Example uses of clusterMany would be to run the same clustering over data sets representing different number of dimensions in a PCA dimensionality reduction, or to run many different choices of alpha in clusterD clustering algorithm, or to compare clustering based on different algorithms -- or all of the above!

clusterMany basically takes all combinations of all the user defined parameters and runs clusterSingle across these combinations (basically expand.grid). Along the way, the function removes some obviously irrelevant combinations (e.g. setting combinations that vary only in alpha and findBestK are irrelevant since those are options for entirely different clustering algorithms and various combinations of them will never be used).

The user can choose run=FALSE in the call to clusterMany to skip actual evaluation of the clusterings, and instead return just a matrix providing the set of parameter combinations that are implied by the user's call to clusterMany. This allows the user, for example, to see how many different combinations their call implies! Or how many of them involve sequential=TRUE or subsample=TRUE. It also allows the user to see what the colnames of the output they would get if they ran clusterMany, which can aid further coding while the computationally expensive command is being run. Setting run=FALSE also allows the user to prune some calls out (see below). Setting run=FALSE is recommended first step before launching computationally expensive clustering runs over many parameters.

User-defined choices of parameters

clusterMany does allow for some mild tinkering with the parameters via the argument paramMatrix where the user can provide a matrix of parameter choices. There are very few checks of this input and the function is not robust to invalid choices! The intended use-case for the paramMatrix argument is that a user does not want to run all combinations of their choices -- only a subset of them -- yet still wants to make use of the clusterMany function to avoid additional coding. The user can run clusterMany with run=FALSE to get the internally-created paramMatrix. Then the user can delete some of the combinations simply by deleting some rows of the full paramMatrix and feed the pruned matrix back into clusterMany.

Users can also change the parameters and try to construct their own combinations of parameters -- but should do so at their own risk, because there are very few checks.

Limitations of clusterMany

However, there can be sets of parameter choices that are not realizable using clusterMany, even with defining your own paramMatrix. clusterMany makes some choices and interpretations of the parameters that the user cannot override. A current example is varying the $K$ used for clustering the subsampled data when subsample=TRUE. Individual calls to clusterSingle can set this $K$ via the subsampleArgs=list("k"=...) (assuming that sequential=FALSE). However, clusterMany has for simplicity a single argument ks that means different parameters in different contexts. In cases where sequential=TRUE, ks defines k0 argument of seqCluster, which therefore also sets the $K$ for clustering subsampled data when subsample=TRUE. In cases where findBestK=TRUE (for the clusterD algorithm) then ks also defines kRanges, if kRanges is not already defined by the user in mainClusterArgs. For cases where findBestK=TRUE and sequential=FALSE and subsample=TRUE, then $K$ for clustering of subsampled data MUST be passed via the argument subsampleArgs. And if findBestK=FALSE, then the ks argument defines both the $K$ for clustering of subsampled data and the $K$ used for clustering the resulting co-ocurrance matrix $D$ (overriding any user specification of either of those parameters via mainClusterArgs or subsampleArgs).

As the above example makes clear, clusterMany is a convenience wrapper that chooses simplicity in the input parameters over fine specification by the users, and in doing so makes subtle choices for the user that are deemed reasonable. It is intended to let the user explore parameters painlessly, but for finer control the user needs to write their own wrapper around clusterSingle.



Bioconductor-mirror/clusterExperiment documentation built on Aug. 2, 2017, 4:28 p.m.