knitr::opts_chunk$set(cache=FALSE) processBoxes<-function(type,nm){ # labWidth<-max(strwidth(nm),xadj) # labHeight<-max(strheight(nm),yadj) usrCoord<-par("usr") labWidth<-max(sapply(strsplit(nm,"\n"),strwidth,cex=1.5))#+0.005*diff(usrCoord[1:2]) labWidthPerct<-labWidth/diff(usrCoord[1:2]) #labWidth<-strwidth(nm)*1.2 labHeight<-strheight(nm,cex=1.5)+0.005*diff(usrCoord[3:4]) if(is.na(type)){ } else{ if(type=="function"){ textellipse(elpos[i,], rady=labHeight,radx=labWidth,lab =nm , box.col = "green", shadow.col = NA, shadow.size = 0, cex = 1.5) } if(type=="arg"){ textdiamond(elpos[i,], rady=labHeight*2,radx=labWidth,lab =nm , box.col = "grey", shadow.col = NA, shadow.size = 0, cex = 1.5) } if(type=="description"){ textrect(elpos[i,], rady=labHeight*.75,radx=labWidth*.75,lab =nm , box.col = "white", shadow.col = NA, shadow.size = 0, cex = 1.5) } if(type=="break"){ textmulti(mid = elpos[i,], nr = 3, radx = labHeight*3, rady = labWidth, lab = nm, box.col="yellow",cex=1.5,angle=270, shadow.col = NA, shadow.size = 0) #inverted because angle is 270 } if(type=="stop"){ textmulti(mid = elpos[i,], nr = 8, radx = labWidth, rady = labWidthPerct*diff(usrCoord[3:4])*.75, lab = nm, box.col="red",cex=1.5, shadow.col = NA, shadow.size = 0) } } } processArrows<-function(coord1,coord2,nm){ pos<-straightarrow (to = coord2, from = coord1, lwd = 2, arr.pos = 0.6, arr.length = 0.5) if(!is.na(nm)){ text(pos[ 1], pos[ 2], nm,pos=if(coord1[1]<=coord2[1]) 4 else 2,offset=1) } }
The goal of this package is to allow the user to try many different clustering algorithms in one package structure. In particular, the package encodes many common practices that are shared across clustering algorithms, like subsampling the data, not clustering samples with negative silhouete scores, sequentially removing clusters and reclustering, and so forth, and allows the user to simply make different choices the parameters as well as the underlying clustering algorithm. The package also allows the user to define their own clustering algorithm.
There are two main user-functions, clusterSingle
and clusterMany
. clusterSingle
is the wrapper function that calls the underlying functions and clusterMany
runs clusterSingle
across combinations of parameter choices. Another useful function is subsampleClustering
which does subsampling of the data and returns a $nxn$ matrix $D$ of the proportion of co-occurances of samples in the same cluster.
There are three main choices that the user needs to make for clusterSingle
:
clusterFunction
). subsample=TRUE
). Otherwise dist(x)
is clustered (subsample=FALSE
). sequential=TRUE
) which finds the best cluster, removes those samples, and then starts the clustering again. The clusterSingle
calls underlying functions for each of these tasks, each of which can take many arguments. For simplicity in syntax, the user passes arguments to each of these underlying functions as lists of arguments.
clusterD
and arguments to clusterD
are passed as a list via the argument mainClusterArgs
. See Section XXX below on clusterD
function for more details about possible arguments.subsample=TRUE
: if subsampling, subsampleClustering
is called and arguments to subsampleClustering
are passed as a list via the argument subsampleArgs
. See Section XXX below on subsampleClustering
. sequential=TRUE
: if sequential method, seqCluster
is called, and arguments to seqCluster
are passed as a list via the argument seqArgs
. See Section XXX below on seqCluster
.For this reason, the next sections will go through the functions that form the components of clusterSingle
so as to better explain the possible options that can be passed.
First we will discuss if sequential=FALSE
, i.e. the user does not choose to use the sequential removal of clusters. Then the algorithm performs what we call the "basic clustering sequence" (this basic sequence is also called iteratively by the sequential method).
There are two parts to the basic clustering sequence:
(Optional, if subsample=TRUE
) Subsample data and cluster it repeatedly via the function subsampleClustering
. The arguments passed to subsampleClustering
are via subsampleArgs
in clusterSingle
. The most important arguments that a user may need to pass to subsampleCluster
are:
x
the $n\times p$ data matrix (which is handled by the argument x
to clusterSingle
so the user does not need to set it)
k
the number of clusters to fit in the subsampled data. This must be passed by the user via subsampleArgs
(unless sequential=TRUE
, see below, where it is set internally)clusterFunction
the function to use to cluster the subsampled data. The default is pam
from the cluster
package but can also be user-defined. The argument clusterFunction
of clusterSingle
is not passed to this argument -- clusterSingle
by default uses the pam
function -- so the user must pass this via subsampleArgs
if they want a different function.
Clustering an input $D$ matrix clusterD
, which can be either the co-occurance matrix from subsampleClustering
or otherwise is calculated as dist(x)
. The arguments passed to clusterD
are via the argument mainClusterArgs
in clusterSingle
. The most important arguments are:
D
the $n\times n$ matrix of dissimilaritiesclusterFunction
the type of clustering to do on the dissimilarity matrix.clusterSingle
(i.e. do not need to be passed with a list of arguments for). Therefore clusterFunction
argument of clusterSingle
is passed to clusterD
(and not to subsampleClustering
)clusterFunction
some other arguments may be necessary to pass via a list of arguments to mainClusterArgs
(e.g. if "pam", must also pass an argument k
or findBestK=TRUE
)We will load some simulated data that comes with the package with four underlying clusters. First we will do just basic clustering of the data for 4 clusters using pam
on the standard distance matrix.
library(clusterExperiment) data(simData) simpleCluster<-clusterSingle(simData, subsample=FALSE, sequential=FALSE, clusterFunction="pam",mainClusterArgs=list('k'=4)) #compare to direct call of pam library(cluster) pamCluster<-pam(dist(simData),4) table(pamCluster$clustering,simpleCluster$clustering)
Of course, this is not the best use case, since the pam object from pam
has much more information. Let's instead choose to do subsampling, and then cluster the co-occurance matrix.
#library(clusterPackage) subsampleCluster<-clusterSingle(simData, subsample=TRUE, sequential=FALSE, clusterFunction="pam",mainClusterArgs=list('k'=4)) table(pamCluster$clustering,subsampleCluster$clustering)
Notice that we got a warning. This is because we could have chosen to cluster with a different number of clusters on the subsampled data than we clustered the co-occurance data. Similarly, the 'clusterFunction' argument is passed to clusterD
-- i.e. the clustering that is done on the co-occurance matrix. We could use something different for clustering of the subsampled data. For example, the following clusters the subsampled data with k=2 and use kmeans, but then clusters the co-occurance matrix with k=4 and pam.
subsampleCluster2<-clusterSingle(simData, subsample=TRUE, sequential=FALSE, clusterFunction="pam",mainClusterArgs=list('k'=4), subsampleArgs=list("k"=2, clusterFunction="kmeans")) table(subsampleCluster2$clustering,subsampleCluster$clustering)
If we want to see the co-occurance matrix, rather than just cluster it, we need to call subsampleClustering
directly, in which case we can plot it and visualize the results.
subD<-subsampleClustering(simData, clusterFunction="pam",k=2) require(NMF) aheatmap(subD)
clusterSingle
DiagramThe following image shows how these clustering functions are put together by clusterSingle
to form the basic clustering sequence, where the functions are indicated by ovals in the flowchart.
#results = 'asis' #cat('\n<img src="flowchartBasicClustering.png" width="300" height="300" />\n') yadj<-0.05 xadj<-yadj library(diagram) openplotmat() ##### #make node matrix ##### elpos <- coordinates (c(1,2,2,4,4)) rnames<-c("1:subsample","2:clusterD","3:subsampleClustering","4:typeAlg","5:clusterD","6:cluster01","7:clusterK","8:typeAlg","8:typeAlg","remove","remove","9:cluster01","10:clusterK") row.names(elpos)<-rnames #made 'fake' so nice. need to take average whDup<-"8:typeAlg" newCoordX<-mean(elpos[which(rnames==whDup),1]) newCoordY<-unique(elpos[which(rnames==whDup),2]) whRm<-which(rnames%in%c(whDup,"remove")) elpos<-elpos[-whRm,] elpos<-rbind(elpos,"8:typeAlg"=c(newCoordX,newCoordY)) typeBox<-c("1:subsample"="arg","2:clusterD"="function","3:subsampleClustering"="function","4:typeAlg"="arg","5:clusterD"="function","6:cluster01"="function","7:clusterK"="function","8:typeAlg"="arg","9:cluster01"="function","10:clusterK"="function") typeBox<-typeBox[match(row.names(elpos),names(typeBox))] ##### #make edge matrix ##### fromtoChar<-rbind(c("1:subsample","2:clusterD"),c("1:subsample","3:subsampleClustering"),c("2:clusterD","4:typeAlg"),c("4:typeAlg","6:cluster01"),c("4:typeAlg","7:clusterK"),c("3:subsampleClustering","5:clusterD"),c("5:clusterD","8:typeAlg"),c("8:typeAlg","9:cluster01"),c("8:typeAlg","10:clusterK") ) fromto <- cbind(match(fromtoChar[,1],row.names(elpos)),match(fromtoChar[,2],row.names(elpos))) row.names(fromto)<-c("FALSE","TRUE",NA,'"01"','"K"',NA,NA,'"01"','"K"') nr <- nrow(fromto) ############ ###Draw Arrows: ############ arrpos <- matrix(ncol = 2, nrow = nr) for (i in 1:nrow(fromto)){ processArrows(elpos[fromto[i, 1], ],elpos[fromto[i, 2], ],row.names(fromto)[i]) } ########### ## Draw boxes ########### nicenames<-sapply(strsplit(row.names(elpos),":"),.subset2,2) for(i in 1:nrow(elpos)){ processBoxes(typeBox[i],nicenames[i]) }
We will go through each of these functions in what follows.
The subsampling of the data and clustering of resampled data is done by the function subsampleClustering
. It takes as input a $n\times p$ data matrix, with samples on the rows. It subsamples the samples (i.e. rows) so that samp.p
proportion of the samples are in each subsample. The subsample is then clustered (based on the clusterFunction
argument). clusterFunction
can be a user-specified function, but can also take on character values to indicate that the function should use the built-in clustering functions (currently 'kmeans' and 'pam').
This subsampling and clustering is done resamp.num
times. The end results is a $n\times n$ co-occurance matrix of the proportion of times that each sample was in the same cluster (out of those subsamples where both samples were chosen).
The function also allows for co-occurance to be based not on just those samples that subsample together, but on all samples by assigning the non-subsampled samples to a cluster by setting the option classifyMethod="All"
. In this case, the co-occurance calculation is based not directly on the clustering of the subsampled data, but the assignment of all samples to clusters, which is done via the classifyFunction
function (again either user-defined, or set automatically when the user choosing 'kmeans' or 'pam' for the clusterFunction
option). Similarly the user can choose classifyMethod="OutOfSample"
to indicate that co-occurance should be based only on when two samples are both not included in a subsample (in which case the samples are assigned to a cluster via classifyFunction
, as described above).
clusterD
is the basic function that clusters an input $n x n$ dissimilarity matrix $D$ based on the clusterFunction
option given by the user (which can be either a character string of pre-detemined options or a user-defined functions). There are (currently) two flavors of clusterD
supported, depending on the nature of the dissimilarity matrix $D$ -- meaning that clusterD
calls one of two different functions (cluster01
and clusterK
) to do the clustering. If the user chooses one of the default clustering methods by setting clusterFunction
to be a character value (one of "pam","tight",or "hierarchical01") then clusterD
automatically recognizes what kind of clustering algorithm ('01' or 'K') is being used and calls it.
However, if the user defines a clustering algorithm in their input to clusterFunction
, then the user must also give the argument typeAlg
to indicate what type of clustering function they are provided. This is because the output of '01' and 'K' clusterings are treated differently by the function.
Each type of clustering algorithm has minimal requirements for the arguments to clusterFunction
(described below). Additional arguments can be passed via the clusterArgs
function, as in subsampleClustering
.
cluster01
is intended for clustering algorithms that assume the elements of the input matrix $D$ are in $[0,1]$ and use the value of the elements of $D$ to determine the clusters (i.e. the clustering algorithm doesn't require an input $K$ to define the number of clusters). The intended use-case here is when $D$ is based on subsampling the samples and clustering them and so that $D$ is a co-occurence matrix.
If the clusterFunction
is of type 01
, then the following additional arguments can be passed to clusterD
: alpha
. alpha
is a required parameter for 01
arguments and is by default set to 0.01.
Built-in clustering functions for '01' type algorithms for include 'tight' and 'hierarchical', both of which search for indices of samples $\mathcal{I}$ of samples for which the submatrix $D[\mathcal{I},\mathcal{I}]$ has values close to $1$ in order to determine a cluster. In this way, instead of requiring the number of clusters, there is a tuning parameter $\alpha\in(0,1)$ which determines if the $D[\mathcal{I},\mathcal{I}]$ values are sufficiently close to $1$.
clusterFunction
should be a function that takes (as a minimum) an argument D
and alpha
. Additional arguments can be present and passed to clusterFunction
via the clusterArgs
option of clusterD
. The output of clusterFunction
must be a list with each element of the list corresponding to a cluster, and the elements of the list containing the indices of the samples that are in the cluster. The list is expected to be in order of 'best clusters' (as defined by the clusterFunction
), with first being the best and last being worst.
Built-in 01 Methods: Setting clusterFunction="tight"
refers to the method of finding clusters from a subsampling matrix given in the tight algorithm of Tsang and Wong. The code is copied from the tight.clust
package (note that the description in the paper is vague, see code to see how it works). The function starts by finding a core cluster of samples which have similarity 1, and then adding samples to that core cluster that all have similarity greater than or equal to 1-alpha
. Arguments for the tight method are minSize.core
(default=2), which sets the minimimum number of samples that form a core
cluster -- i.e. how many samples must have similarity exactly 1 to start building a cluster.
Setting clusterFunction="hierarchical01"
refers to running the hclust
algorithm on D
and transversing down the hierarchical tree until getting a block of samples $\mathcal{I}$ such that $f(D[\mathcal{I},\mathcal{I}])$ is greater than or equal to 1-alpha
. Arguments that can be passed to 'hierarchical' are evalClusterMethod
which determines the function $f$ -- i.e. how to summarize the values of $D[\mathcal{I},\mathcal{I}]$ in order to determine if the block of samples is sufficiently similar. If evalClusterMethod="minimum"
(default) takes as $f$ simply the minimum of $D[\mathcal{I},\mathcal{I}]$ and requires it to be greater than or equal to 1-alpha to be a cluster. evalClusterMethod="average"
takes $f$ to be the minimum of the row means of $D[\mathcal{I},\mathcal{I}]$. Arguments to hclust
can also be passed via clusterArgs
to control the hierarchical clustering of D
.
clusterK
is intended for algorithms that expect the user to give a value $K$ giving the number of clusters. This function then includes options appropriate for such algorithms, such as searching over a range of values $K$ and returning the 'best' cluster, or removing samples with low 'silhouette' values.
If the clusterFunction
is of type K
, then additional arguments can be passed to clusterD
: k
,findBestK
, removeSil
,kRange
,silCutoff
(see below for description)
For clusterK
functions, clusterFunction
should be a function that takes as a minimum an argument 'D' and 'k'. The output must be a list similar to that of partition.object
of cluster
package. Specifically, an element clustering
which gives the vector of clusters; and an argument silinfo
like that of the partition.object
-- that is a list with silhouette values.
Whether these silhouette values are actually silhouette values is up to the clusterFunction
, but they will be used in the following way. If findBestK=TRUE
, silinfo$avg.width will be used to pick the best $K$ and the set of values of $K$ that will be searched over is given by kRange
. If removeSil=TRUE
, then silinfo$widths[,"sil_width"]
will be used to exclude samples whose value of silinfo$widths[,"sil_width"]
is less than silCutoff
. In addition, clusters will be ordered by the average of the values silinfo$widths[,"sil_width"]
in each cluster (after removing poorly clustered samples, if removeSil=TRUE
).
Built-in clusterK methods: clusterFunction="pam"
performs pam
clustering on the input D
using pam
in the cluster
package. Arguments to pam
can be passed via clusterArgs
, except for the arguments x
and k
which are given by D
and k
directly in the call to clusterD
.
The clusterSingle
function also provides the ability to do sequential clustering, namely clustering the data, removing the cluster, and reclustering the remaining data. Since clustering algorithms can be sensitive to what set of samples are being used, this can have an impact on the clustering results. Furthermore, it can avoid the question of setting the number of clusters k, and instead find the number of clusters as when the sequential strategy doesn't find any more clusters (or enough samples) to cluster. Of course the stopping rules themselves requires tuning parameters, so is only a gain if the stopping rules are more intuitive or defensible than picking the number of clusters.
The sequential strategy that is currently implemented follows the 'tight' algorithm of XXX[add citation], though we have modified its implementation to be more appropriate to clustering of samples, rather than the gene clustering setting of the original tight algorithm, as well as allowing for many more choices.
In short, the sequential strategy iterates the basic clustering strategy described above, continually increasing the k
parameter of subsampleClustering until there exists a stable cluster when $k$ is increased to $k+1$. A stable cluster is defined as finding one cluster from the clusters using $k$ whose membership is similar enough to that of one of the clusters found for $k-1$, where 'similar' enough is defined by the parameter beta
,
$$\text{Put equation here for similarity}$$
Once such a cluster is found, the samples in the cluster are removed, and the process of finding another stable is restarted.
The following image represents how the sequential strategy works:
#results = 'asis'} #cat('\n<img src="flowchartTop.png" width="300" height="300" />\n') yadj<-0.05 xadj<-yadj require(diagram) par(mar=c(.1,.1,.1,1)) openplotmat() ##### #make node matrix ##### elpos <- coordinates (c(1,1,rep(2,4),4,4,4,8)) rnames<-c("1:clusterSingle","2:sequential","3:Basic Clustering\n Routine","4:seqCluster","remove","5:Set k.start=k0\nn=No.samples","remove","6:Iterate Basic\nClustering Routine\nk=k.start,kstart+1,...","remove","7:Break if", #"remove","remove","8:k=k.max &\n no similarity < beta", "9:similarity of\n(k,k-1)<beta", "remove","remove","10:Stop","11:Cluster Found\nRemove Samples","remove","remove","remove","12:Reset n","remove","remove","remove","13:n<remain.n",rep("remove",6),"14:Stop","15:Restart with\nk.start=k.start-1\n(but no less than k.min)") row.names(elpos)<-rnames elpos<-elpos[-which(row.names(elpos)=="remove"),] elpos[,1]<-elpos[,1]-min(elpos[,1])*.4 elpos["14:Stop",1]<-elpos["14:Stop",1]-.1 typeBox<-c("1:clusterSingle"="function","2:sequential"="arg","3:Basic Clustering\n Routine"="description","4:seqCluster"="function","5:Set k.start=k0\nn=No.samples"="description","6:Iterate Basic\nClustering Routine\nk=k.start,kstart+1,..."="description","7:Break if"="break", #"8:k=k.max &\n no similarity < beta"="description", "9:similarity of\n(k,k-1)<beta"="description" , "10:Stop"="stop","11:Cluster Found\nRemove Samples"="description", "12:Reset n"="description","13:n<remain.n"="arg","14:Stop"="stop","15:Restart with\nk.start=k.start-1\n(but no less than k.min)"="description") typeBox<-typeBox[match(row.names(elpos),names(typeBox))] ##### #make edge matrix ##### fromtoChar<-rbind(c("1:clusterSingle","2:sequential"),c("2:sequential","3:Basic Clustering\n Routine"),c("2:sequential","4:seqCluster"), c("4:seqCluster","5:Set k.start=k0\nn=No.samples"),c("5:Set k.start=k0\nn=No.samples","6:Iterate Basic\nClustering Routine\nk=k.start,kstart+1,..."),c("6:Iterate Basic\nClustering Routine\nk=k.start,kstart+1,...","7:Break if"), c("7:Break if","10:Stop"), c("7:Break if","11:Cluster Found\nRemove Samples"),c("11:Cluster Found\nRemove Samples","12:Reset n"),c("12:Reset n","13:n<remain.n"),c("13:n<remain.n","14:Stop"),c("13:n<remain.n","15:Restart with\nk.start=k.start-1\n(but no less than k.min)") ) #c("7:Break if","8:k=k.max &\n no similarity < beta"),c("7:Break if","9:similarity of\n(k,k-1)<beta"),c("8:k=k.max &\n no similarity < beta","10:Stop"),c("9:similarity of\n(k,k-1)<beta","11:Cluster Found\nRemove Samples"), fromto <- cbind(match(fromtoChar[,1],row.names(elpos)),match(fromtoChar[,2],row.names(elpos))) row.names(fromto)<-c(NA,"FALSE","TRUE",rep(NA,3),"k=k.max &\n no similarity < beta","similarity of\n(k,k-1)<beta",rep(NA,2),"TRUE","FALSE") ############ ###Draw Arrows: ############ for (i in 1:nrow(fromto)){ processArrows(elpos[fromto[i, 1], ],elpos[fromto[i, 2], ],row.names(fromto)[i]) } curvedarrow(from = elpos["15:Restart with\nk.start=k.start-1\n(but no less than k.min)", ], to = elpos["5:Set k.start=k0\nn=No.samples", ]+c(.05,0), lty = 1, lwd=5, lcol = 2,arr.pos=0.8,curve=0.5,arr.lwd=8,arr.length=0.6) selfarrow(pos=elpos["6:Iterate Basic\nClustering Routine\nk=k.start,kstart+1,...", ]+c(.1,0), path="R", lty = 1, lwd=2, lcol = 2,arr.pos=.2,curve=c(0.1,0.05)) ########### ## Draw boxes ########### nicenames<-sapply(strsplit(row.names(elpos),":"),.subset2,2) for(i in 1:nrow(elpos)){ processBoxes(typeBox[i],nicenames[i]) }
clusterMany
)clusterMany
is a function for running many different combinations of parameters or different datasets. It's main goal is to make it easy to running many different parameter choices and reduce the coding overhead for doing so. clusterMany
calls clusterSingle
using lapply
(or mclapply
if parallelizing the computation). Example uses of clusterMany
would be to run the same clustering over data sets representing different number of dimensions in a PCA dimensionality reduction, or to run many different choices of alpha
in clusterD clustering algorithm, or to compare clustering based on different algorithms -- or all of the above!
clusterMany
basically takes all combinations of all the user defined parameters and runs clusterSingle
across these combinations (basically expand.grid
). Along the way, the function removes some obviously irrelevant combinations (e.g. setting combinations that vary only in alpha
and findBestK
are irrelevant since those are options for entirely different clustering algorithms and various combinations of them will never be used).
The user can choose run=FALSE
in the call to clusterMany
to skip actual evaluation of the clusterings, and instead return just a matrix providing the set of parameter combinations that are implied by the user's call to clusterMany
. This allows the user, for example, to see how many different combinations their call implies! Or how many of them involve sequential=TRUE
or subsample=TRUE
. It also allows the user to see what the colnames of the output they would get if they ran clusterMany
, which can aid further coding while the computationally expensive command is being run. Setting run=FALSE
also allows the user to prune some calls out (see below). Setting run=FALSE
is recommended first step before launching computationally expensive clustering runs over many parameters.
clusterMany
does allow for some mild tinkering with the parameters via the argument paramMatrix
where the user can provide a matrix of parameter choices. There are very few checks of this input and the function is not robust to invalid choices! The intended use-case for the paramMatrix
argument is that a user does not want to run all combinations of their choices -- only a subset of them -- yet still wants to make use of the clusterMany
function to avoid additional coding. The user can run clusterMany
with run=FALSE
to get the internally-created paramMatrix
. Then the user can delete some of the combinations simply by deleting some rows of the full paramMatrix
and feed the pruned matrix back into clusterMany
.
Users can also change the parameters and try to construct their own combinations of parameters -- but should do so at their own risk, because there are very few checks.
clusterMany
However, there can be sets of parameter choices that are not realizable using clusterMany
, even with defining your own paramMatrix
. clusterMany
makes some choices and interpretations of the parameters that the user cannot override. A current example is varying the $K$ used for clustering the subsampled data when subsample=TRUE
. Individual calls to clusterSingle
can set this $K$ via the subsampleArgs=list("k"=...)
(assuming that sequential=FALSE
). However, clusterMany
has for simplicity a single argument ks
that means different parameters in different contexts. In cases where sequential=TRUE
, ks
defines k0
argument of seqCluster
, which therefore also sets the $K$ for clustering subsampled data when subsample=TRUE
. In cases where findBestK=TRUE
(for the clusterD algorithm) then ks
also defines kRanges
, if kRanges
is not already defined by the user in mainClusterArgs
. For cases where findBestK=TRUE
and sequential=FALSE
and subsample=TRUE
, then $K$ for clustering of subsampled data MUST be passed via the argument subsampleArgs
. And if findBestK=FALSE
, then the ks
argument defines both the $K$ for clustering of subsampled data and the $K$ used for clustering the resulting co-ocurrance matrix $D$ (overriding any user specification of either of those parameters via mainClusterArgs
or subsampleArgs
).
As the above example makes clear, clusterMany
is a convenience wrapper that chooses simplicity in the input parameters over fine specification by the users, and in doing so makes subtle choices for the user that are deemed reasonable. It is intended to let the user explore parameters painlessly, but for finer control the user needs to write their own wrapper around clusterSingle
.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.