Create a matrix of clustering across values of parameters

Share:

Description

Given a range of parameters, this funciton will return a matrix with the clustering of the samples across the range, which can be passed to plotClusters for visualization.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
## S4 method for signature 'matrix'
clusterMany(x, dimReduce = "none", nVarDims = NA,
  nPCADims = NA, transFun = NULL, isCount = FALSE, ...)

## S4 method for signature 'list'
clusterMany(x, ks = NA, clusterFunction, alphas = 0.1,
  findBestK = FALSE, sequential = FALSE, removeSil = FALSE,
  subsample = FALSE, silCutoff = 0, distFunction = NA, betas = 0.9,
  minSizes = 1, verbose = FALSE, clusterDArgs = NULL,
  subsampleArgs = NULL, seqArgs = NULL, ncores = 1, random.seed = NULL,
  run = TRUE, ...)

## S4 method for signature 'ClusterExperiment'
clusterMany(x, dimReduce = "none",
  nVarDims = NA, nPCADims = NA, eraseOld = FALSE, ...)

## S4 method for signature 'SummarizedExperiment'
clusterMany(x, dimReduce = "none",
  nVarDims = NA, nPCADims = NA, transFun = NULL, isCount = FALSE, ...)

Arguments

x

the data on which to run the clustering. Can be: matrix (with genes in rows), a list of datasets overwhich the clusterings should be run, a SummarizedExperiment object, or a ClusterExperiment object.

dimReduce

character A character identifying what type of dimensionality reduction to perform before clustering. Options are "none","PCA", "var","cv", and "mad". See transform for more details.

nVarDims

vector of the number of the most variable features to keep (when "var", "cv", or "mad" is identified in dimReduce). If NA is included, then the full dataset will also be included.

nPCADims

vector of the number of PCs to use (when 'PCA' is identified in dimReduce). If NA is included, then the full dataset will also be included.

transFun

function A function to use to transform the input data matrix before clustering.

isCount

logical. Whether the data are in counts, in which case the default transFun argument is set as log2(x+1). This is simply a convenience to the user, and can be overridden by giving an explicit function to transFun.

...

For signature list, arguments to be passed on to mclapply (if ncores>1). For all the other signatures, arguments to be passed to the method for signature list.

ks

the range of k values (see details for meaning for different choices).

clusterFunction

function used for the clustering. Note that unlike in clusterSingle, this must be a character vector of pre-defined clustering techniques provided by clusterSingle, and can not be a user-defined function. Current functions are "tight", "hierarchical01","hierarchicalK", and "pam"

alphas

values of alpha to be tried. Only used for clusterFunctions of type '01' (either 'tight' or 'hierarchical01'). Determines tightness required in creating clusters from the dissimilarity matrix. Takes on values in [0,1]. See clusterD.

findBestK

logical, whether should find best K based on average silhouette width (only used if clusterFunction of type "K").

sequential

logical whether to use the sequential strategy (see details of seqCluster).

removeSil

logical as to whether remove when silhouette < silCutoff (only used if clusterFunction of type "K")

subsample

logical as to whether to subsample via subsampleClustering to get the distance matrix at each iteration; otherwise the distance function will be determined by argument distFunction passed in clusterDArgs.

silCutoff

Requirement on minimum silhouette width to be included in cluster (only if removeSil=TRUE).

distFunction

a vector of character strings that are the names of distance functions found in the global environment. See the help pages of clusterD for details about the required format of distance functions. Currently, this distance function must be applicable for all clusterFunction types tried. Therefore, it is not possible to intermix type "K" and type "01" algorithms if you also give distances to evaluate via distFunction unless all distances give 0-1 values for the distance (and hence are possible for both type "01" and "K" algorithms).

betas

values of beta to be tried in sequential steps. Only used for sequential=TRUE. Determines the similarity between two clusters required in order to deem the cluster stable. Takes on values in [0,1]. See seqCluster.

minSizes

the minimimum size required for a cluster (in clusterD). Clusters smaller than this are not kept and samples are left unassigned.

verbose

logical. If TRUE it will print informative messages.

clusterDArgs

list of additional arguments to be passed to clusterD.

subsampleArgs

list of arguments to be passed to subsampleClustering.

seqArgs

list of additional arguments to be passed to seqCluster.

ncores

the number of threads

random.seed

a value to set seed before each run of clusterSingle (so that all of the runs are run on the same subsample of the data). Note, if 'random.seed' is set, argument 'ncores' should NOT be passed via subsampleArgs; instead set the argument 'ncores' of clusterMany directly (which is preferred for improving speed anyway).

run

logical. If FALSE, doesn't run clustering, but just returns matrix of parameters that will be run, for the purpose of inspection by user (with rownames equal to the names of the resulting column names of clMat object that would be returned if run=TRUE). Even if run=FALSE, however, the function will create the dimensionality reductions of the data indicated by the user input.

eraseOld

logical. Only relevant if input x is of class ClusterExperiment. If TRUE, will erase existing workflow results (clusterMany as well as mergeClusters and combineMany). If FALSE, existing workflow results will have "_i" added to the clusterTypes value, where i is one more than the largest such existing workflow clusterTypes.

Details

While the function allows for multiple values of clusterFunction, the code does not reuse the same subsampling matrix and try different clusterFunctions on it. If sequential=TRUE, different subsampleclusterFunctions will create different sets of data to subsample so it is not possible; if sequential=FALSE, we have not implemented functionality for this reuse. Setting the random.seed value, however, should mean that the subsampled matrix is the same for each, but there is no gain in computational complexity (i.e. each subsampled co-occurence matrix is recalculated for each set of parameters).

The argument 'ks' is interpreted differently for different choices of the other parameters. When/if sequential=TRUE, ks defines the argument k0 of seqCluster. Otherwise, 'ks' values are set in both subsampleArgs[["k"]] and clusterDArgs[["k"]] that are passed to clusterD and subsampleClustering. This passing of these arguments via subsampleArgs[["k"]] will only have an effect if 'subsample=TRUE'. Similarly, the passing of clusterDArgs[["k"]] will only have an effect when the clusterFunction argument includes a clustering algorithm of type "K". When/if "findBestK=TRUE", ks also defines the kRange argument of clusterD unless kRange is specified by the user via the clusterDArgs; note this means that the default option of setting kRange that depends on the input k (see clusterD) is not available in clusterMany.

If the input is a ClusterExperiment object, currently existing orderSamples,coClustering or dendrogram slots will be retained.

Value

If run=TRUE and the input is either a matrix, a SummarizedExperiment object, or a ClusterExperiment object, will return a ClusterExperiment object, where the results are stored as clusterings with clusterTypes clusterMany. Depending on eraseOld argument above, this will either delete existing such objects, or change the clusterTypes of existing objects. See argument eraseOld above. Arbitrarily the first clustering is set as the primaryClusteringIndex.

If run=TRUE and the input is a list of data sets, a list with the following objects:

  • clMat a matrix with each column corresponding to a clustering and each row to a sample.

  • clusterInfo a list with information regarding clustering results (only relevant entries for those clusterings with sequential=TRUE)

  • paramMatrix a matrix giving the parameters of each clustering, where each column is a possible parameter set by the user and passed to clusterSingle and each row of paramMatrix corresponds to a clustering in clMat

  • clusterDArgs a list of (possibly modified) arguments to clusterDArgs

  • seqArgs=seqArgsa list of (possibly modified) arguments to seqArgs

  • subsampleArgsa list of (possibly modified) arguments to subsampleArgs

If run=FALSE a list similar to that described above, but without the clustering results.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
data(simData)

#Example: clustering using pam with different dimensions of pca and different
#k and whether remove negative silhouette values
#check how many and what runs user choices will imply:
checkParams <- clusterMany(simData,nPCADims=c(5,10,50),  dimReduce="PCA",
clusterFunction="pam",
ks=2:4,findBestK=c(TRUE,FALSE),removeSil=c(TRUE,FALSE),run=FALSE)
print(head(checkParams$paramMatrix))

#Now actually run it
cl <- clusterMany(simData,nPCADims=c(5,10,50),  dimReduce="PCA",
clusterFunction="pam",ks=2:4,findBestK=c(TRUE,FALSE),removeSil=c(TRUE,FALSE))
print(cl)
head(colnames(clusterMatrix(cl)))

#make names shorter for plotting
clMat <- clusterMatrix(cl)
colnames(clMat) <- gsub("TRUE", "T", colnames(clMat))
colnames(clMat) <- gsub("FALSE", "F", colnames(clMat))
colnames(clMat) <- gsub("k=NA,", "", colnames(clMat))

par(mar=c(2, 10, 1, 1))
plotClusters(clMat, axisLine=-2)


## Not run: 
#following code takes around 1+ minutes to run because of the subsampling
#that is redone each time:
system.time(clusterTrack <- clusterMany(simData, ks=2:15,
alphas=c(0.1,0.2,0.3), findBestK=c(TRUE,FALSE), sequential=c(FALSE),
subsample=c(FALSE), removeSil=c(TRUE), clusterFunction="pam",
clusterDArgs=list(minSize=5, kRange=2:15), ncores=1, random.seed=48120))

## End(Not run)