clusterMany: Create a matrix of clustering across values of parameters

Description Usage Arguments Details Value Examples

Description

Given a range of parameters, this function will return a matrix with the clustering of the samples across the range, which can be passed to plotClusters for visualization.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
## S4 method for signature 'matrix'
clusterMany(x, dimReduce = "none", nVarDims = NA,
  nPCADims = NA, transFun = NULL, isCount = FALSE, ...)

## S4 method for signature 'list'
clusterMany(x, ks = NA, clusterFunction, alphas = 0.1,
  findBestK = FALSE, sequential = FALSE, removeSil = FALSE,
  subsample = FALSE, silCutoff = 0, distFunction = NA, betas = 0.9,
  minSizes = 1, verbose = FALSE, mainClusterArgs = NULL,
  subsampleArgs = NULL, seqArgs = NULL, ncores = 1, random.seed = NULL,
  run = TRUE, ...)

## S4 method for signature 'ClusterExperiment'
clusterMany(x, dimReduce = "none",
  nVarDims = NA, nPCADims = NA, eraseOld = FALSE, ...)

## S4 method for signature 'SummarizedExperiment'
clusterMany(x, dimReduce = "none",
  nVarDims = NA, nPCADims = NA, transFun = NULL, isCount = FALSE, ...)

## S4 method for signature 'data.frame'
clusterMany(x, ...)

Arguments

x

the data matrix on which to run the clustering. Can be: matrix (with genes in rows), a list of datasets overwhich the clusterings should be run, a SummarizedExperiment object, or a ClusterExperiment object.

dimReduce

character A character identifying what type of dimensionality reduction to perform before clustering. Options are "none","PCA", "var","cv", and "mad". See transform for more details.

nVarDims

vector of the number of the most variable features to keep (when "var", "cv", or "mad" is identified in dimReduce). If NA is included, then the full dataset will also be included.

nPCADims

vector of the number of PCs to use (when 'PCA' is identified in dimReduce). If NA is included, then the full dataset will also be included.

transFun

function A function to use to transform the input data matrix before clustering.

isCount

logical. Whether the data are in counts, in which case the default transFun argument is set as log2(x+1). This is simply a convenience to the user, and can be overridden by giving an explicit function to transFun.

...

For signature list, arguments to be passed on to mclapply (if ncores>1). For all the other signatures, arguments to be passed to the method for signature list.

ks

the range of k values (see details for the meaning of k for different choices of other parameters).

clusterFunction

function used for the clustering. Note that unlike in clusterSingle, this must be a character vector of pre-defined clustering techniques, and can not be a user-defined function. Current functions can be found by typing listBuiltInFunctions() into the command-line.

alphas

values of alpha to be tried. Only used for clusterFunctions of type '01'. Determines tightness required in creating clusters from the dissimilarity matrix. Takes on values in [0,1]. See documentation of ClusterFunction.

findBestK

logical, whether should find best K based on average silhouette width (only used when clusterFunction of type "K").

sequential

logical whether to use the sequential strategy (see details of seqCluster). Can be used in combination with subsample=TRUE or FALSE.

removeSil

logical as to whether remove when silhouette < silCutoff (only used if clusterFunction of type "K")

subsample

logical as to whether to subsample via subsampleClustering. If TRUE, clustering in mainClustering step is done on the co-occurance between clusterings in the subsampled clustering results. If FALSE, the mainClustering step will be run directly on x/diss

silCutoff

Requirement on minimum silhouette width to be included in cluster (only for combinations where removeSil=TRUE).

distFunction

a vector of character strings that are the names of distance functions found in the global environment. See the help pages of clusterSingle for details about the required format of distance functions. Currently, this distance function must be applicable for all clusterFunction types tried. Therefore, it is not possible in clusterMany to intermix type "K" and type "01" algorithms if you also give distances to evaluate via distFunction unless all distances give 0-1 values for the distance (and hence are possible for both type "01" and "K" algorithms).

betas

values of beta to be tried in sequential steps. Only used for sequential=TRUE. Determines the similarity between two clusters required in order to deem the cluster stable. Takes on values in [0,1]. See documentation of seqCluster.

minSizes

the minimimum size required for a cluster (in the mainClustering step). Clusters smaller than this are not kept and samples are left unassigned.

verbose

logical. If TRUE it will print informative messages.

mainClusterArgs

list of arguments to be passed for the mainClustering step, see help pages of mainClustering.

subsampleArgs

list of arguments to be passed to the subsampling step (if subsample=TRUE), see help pages of subsampleClustering.

seqArgs

list of arguments to be passed to seqCluster.

ncores

the number of threads

random.seed

a value to set seed before each run of clusterSingle (so that all of the runs are run on the same subsample of the data). Note, if 'random.seed' is set, argument 'ncores' should NOT be passed via subsampleArgs; instead set the argument 'ncores' of clusterMany directly (which is preferred for improving speed anyway).

run

logical. If FALSE, doesn't run clustering, but just returns matrix of parameters that will be run, for the purpose of inspection by user (with rownames equal to the names of the resulting column names of clMat object that would be returned if run=TRUE). Even if run=FALSE, however, the function will create the dimensionality reductions of the data indicated by the user input.

eraseOld

logical. Only relevant if input x is of class ClusterExperiment. If TRUE, will erase existing workflow results (clusterMany as well as mergeClusters and combineMany). If FALSE, existing workflow results will have "_i" added to the clusterTypes value, where i is one more than the largest such existing workflow clusterTypes.

Details

Some combinations of these parameters are not feasible. See the documentation of clusterSingle for important information on how these parameter choices interact.

While the function allows for multiple values of clusterFunction, the code does not reuse the same subsampling matrix and try different clusterFunctions on it. This is because if sequential=TRUE, different subsample clusterFunctions will create different sets of data to subsample so it is not possible; if sequential=FALSE, we have not implemented functionality for this reuse. Setting the random.seed value, however, should mean that the subsampled matrix is the same for each, but there is no gain in computational complexity (i.e. each subsampled co-occurence matrix is recalculated for each set of parameters).

The argument ks is interpreted differently for different choices of the other parameters. When/if sequential=TRUE, ks defines the argument k0 of seqCluster. Otherwise, ks values are the k values for both the mainClustering and subsampling step (i.e. assigned to the subsampleArgs and mainClusterArgs that are passed to mainClustering and subsampleClustering unless k is set appropriately in subsampleArgs. The passing of these arguments via subsampleArgs will only have an effect if 'subsample=TRUE'. Similarly, the passing of mainClusterArgs[["k"]] will only have an effect when the clusterFunction argument includes a clustering algorithm of type "K". When/if "findBestK=TRUE", ks also defines the kRange argument of mainClustering unless kRange is specified by the user via the mainClusterArgs; note this means that the default option of setting kRange that depends on the input k (see mainClustering) is not available in clusterMany, only in clusterSingle.

If the input is a ClusterExperiment object, current implementation is that existing orderSamples,coClustering or the many dendrogram slots will be retained.

Value

If run=TRUE and the input is either a matrix, a SummarizedExperiment object, or a ClusterExperiment object, will return a ClusterExperiment object, where the results are stored as clusterings with clusterTypes clusterMany. Depending on eraseOld argument above, this will either delete existing such objects, or change the clusterTypes of existing objects. See argument eraseOld above. Arbitrarily the first clustering is set as the primaryClusteringIndex.

If run=TRUE and the input is a list of data sets, a list with the following objects:

If run=FALSE a list similar to that described above, but without the clustering results.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
data(simData)

#Example: clustering using pam with different dimensions of pca and different
#k and whether remove negative silhouette values
#check how many and what runs user choices will imply:
checkParams <- clusterMany(simData,nPCADims=c(5,10,50),  dimReduce="PCA",
clusterFunction="pam",
ks=2:4,findBestK=c(TRUE,FALSE),removeSil=c(TRUE,FALSE),run=FALSE)
print(head(checkParams$paramMatrix))

#Now actually run it
cl <- clusterMany(simData,nPCADims=c(5,10,50),  dimReduce="PCA",
clusterFunction="pam",ks=2:4,findBestK=c(TRUE,FALSE),removeSil=c(TRUE,FALSE))
print(cl)
head(colnames(clusterMatrix(cl)))

#make names shorter for plotting
clMat <- clusterMatrix(cl)
colnames(clMat) <- gsub("TRUE", "T", colnames(clMat))
colnames(clMat) <- gsub("FALSE", "F", colnames(clMat))
colnames(clMat) <- gsub("k=NA,", "", colnames(clMat))

par(mar=c(2, 10, 1, 1))
plotClusters(clMat, axisLine=-2)


## Not run: 
#following code takes around 1+ minutes to run because of the subsampling
#that is redone each time:
system.time(clusterTrack <- clusterMany(simData, ks=2:15,
alphas=c(0.1,0.2,0.3), findBestK=c(TRUE,FALSE), sequential=c(FALSE),
subsample=c(FALSE), removeSil=c(TRUE), clusterFunction="pam",
mainClusterArgs=list(minSize=5, kRange=2:15), ncores=1, random.seed=48120))

## End(Not run)

clusterExperiment documentation built on Nov. 17, 2017, 8:35 a.m.