clusterMany: Create a matrix of clustering across values of parameters

clusterManyR Documentation

Create a matrix of clustering across values of parameters

Description

Given a range of parameters, this function will return a matrix with the clustering of the samples across the range, which can be passed to plotClusters for visualization.

Usage

## S4 method for signature 'matrixOrHDF5'
clusterMany(
  x,
  reduceMethod = "none",
  nReducedDims = NA,
  transFun = NULL,
  isCount = FALSE,
  ...
)

## S4 method for signature 'SingleCellExperiment'
clusterMany(
  x,
  ks = NA,
  clusterFunction,
  reduceMethod = "none",
  nFilterDims = defaultNDims(x, reduceMethod, type = "filterStats"),
  nReducedDims = defaultNDims(x, reduceMethod, type = "reducedDims"),
  alphas = 0.1,
  findBestK = FALSE,
  sequential = FALSE,
  removeSil = FALSE,
  subsample = FALSE,
  silCutoff = 0,
  distFunction = NA,
  betas = 0.9,
  minSizes = 1,
  transFun = NULL,
  isCount = FALSE,
  verbose = TRUE,
  parameterWarnings = FALSE,
  mainClusterArgs = NULL,
  subsampleArgs = NULL,
  seqArgs = NULL,
  whichAssay = 1,
  makeMissingDiss = if (ncol(x) < 1000) TRUE else FALSE,
  ncores = 1,
  random.seed = NULL,
  run = TRUE,
  ...
)

## S4 method for signature 'ClusterExperiment'
clusterMany(
  x,
  reduceMethod = "none",
  nFilterDims = defaultNDims(x, reduceMethod, type = "filterStats"),
  nReducedDims = defaultNDims(x, reduceMethod, type = "reducedDims"),
  eraseOld = FALSE,
  ...
)

## S4 method for signature 'SummarizedExperiment'
clusterMany(x, ...)

## S4 method for signature 'data.frame'
clusterMany(x, ...)

Arguments

x

the data matrix on which to run the clustering. Can be object of the following classes: matrix (with genes in rows), SummarizedExperiment, SingleCellExperiment or ClusterExperiment.

reduceMethod

character A character identifying what type of dimensionality reduction to perform before clustering. Options are 1) "none", 2) one of listBuiltInReducedDims() or listBuiltInFitlerStats OR 3) stored filtering or reducedDim values in the object.

nReducedDims

vector of the number of dimensions to use (when reduceMethod gives a dimensionality reduction method).

transFun

a transformation function to be applied to the data. If the transformation applied to the data creates an error or NA values, then the function will throw an error. If object is of class ClusterExperiment, the stored transformation will be used and giving this parameter will result in an error.

isCount

if transFun=NULL, then isCount=TRUE will determine the transformation as defined by function(x){log2(x+1)}, and isCount=FALSE will give a transformation function function(x){x}. Ignored if transFun=NULL. If object is of class ClusterExperiment, the stored transformation will be used and giving this parameter will result in an error.

...

For signature matrix, arguments to be passed on to mclapply (if ncores>1). For all the other signatures, arguments to be passed to the method for signature matrix.

ks

the range of k values (see details for the meaning of k for different choices of other parameters).

clusterFunction

function used for the clustering. This must be either 1) a character vector of built-in clustering techniques, or 2) a named list of ClusterFunction objects. Current functions can be found by typing listBuiltInFunctions() into the command-line.

nFilterDims

vector of the number of the most variable features to keep (when "var", "abscv", or "mad" is identified in reduceMethod).

alphas

values of alpha to be tried. Only used for clusterFunctions of type '01'. Determines tightness required in creating clusters from the dissimilarity matrix. Takes on values in [0,1]. See documentation of ClusterFunction.

findBestK

logical, whether should find best K based on average silhouette width (only used when clusterFunction of type "K").

sequential

logical whether to use the sequential strategy (see details of seqCluster). Can be used in combination with subsample=TRUE or FALSE.

removeSil

logical as to whether remove when silhouette < silCutoff (only used if clusterFunction of type "K")

subsample

logical as to whether to subsample via subsampleClustering. If TRUE, clustering in mainClustering step is done on the co-occurance between clusterings in the subsampled clustering results. If FALSE, the mainClustering step will be run directly on x/diss

silCutoff

Requirement on minimum silhouette width to be included in cluster (only for combinations where removeSil=TRUE).

distFunction

a vector of character strings that are the names of distance functions found in the global environment. See the help pages of clusterSingle for details about the required format of distance functions. Currently, this distance function must be applicable for all clusterFunction types tried. Therefore, it is not possible in clusterMany to intermix type "K" and type "01" algorithms if you also give distances to evaluate via distFunction unless all distances give 0-1 values for the distance (and hence are possible for both type "01" and "K" algorithms).

betas

values of beta to be tried in sequential steps. Only used for sequential=TRUE. Determines the similarity between two clusters required in order to deem the cluster stable. Takes on values in [0,1]. See documentation of seqCluster.

minSizes

the minimimum size required for a cluster (in the mainClustering step). Clusters smaller than this are not kept and samples are left unassigned.

verbose

logical. If TRUE it will print informative messages.

parameterWarnings

logical, as to whether warnings and comments from checking the validity of the parameter combinations should be printed.

mainClusterArgs

list of arguments to be passed for the mainClustering step, see help pages of mainClustering.

subsampleArgs

list of arguments to be passed to the subsampling step (if subsample=TRUE), see help pages of subsampleClustering.

seqArgs

list of arguments to be passed to seqCluster.

whichAssay

numeric or character specifying which assay to use. See assay for details.

makeMissingDiss

logical. Whether to calculate necessary distance matrices needed when input is not "diss". If TRUE, then when a clustering function calls for a inputType "diss", but the given matrix is of type "X", the function will calculate a distance function. A dissimilarity matrix will also be calculated if a post-processing argument like findBestK or removeSil is chosen, since these rely on calcualting silhouette widths from distances.

ncores

the number of threads

random.seed

a value to set seed before each run of clusterSingle (so that all of the runs are run on the same subsample of the data). Note, if 'random.seed' is set, argument 'ncores' should NOT be passed via subsampleArgs; instead set the argument 'ncores' of clusterMany directly (which is preferred for improving speed anyway).

run

logical. If FALSE, doesn't run clustering, but just returns matrix of parameters that will be run, for the purpose of inspection by user (with rownames equal to the names of the resulting column names of clMat object that would be returned if run=TRUE). Even if run=FALSE, however, the function will create the dimensionality reductions of the data indicated by the user input.

eraseOld

logical. Only relevant if input x is of class ClusterExperiment. If TRUE, will erase existing workflow results (clusterMany as well as mergeClusters and makeConsensus). If FALSE, existing workflow results will have "_i" added to the clusterTypes value, where i is one more than the largest such existing workflow clusterTypes.

Details

Some combinations of these parameters are not feasible. See the documentation of clusterSingle for important information on how these parameter choices interact.

While the function allows for multiple values of clusterFunction, the code does not reuse the same subsampling matrix and try different clusterFunctions on it. This is because if sequential=TRUE, different subsample clusterFunctions will create different sets of data to subsample so it is not possible; if sequential=FALSE, we have not implemented functionality for this reuse. Setting the random.seed value, however, should mean that the subsampled matrix is the same for each, but there is no gain in computational complexity (i.e. each subsampled co-occurence matrix is recalculated for each set of parameters).

The argument ks is interpreted differently for different choices of the other parameters. When/if sequential=TRUE, ks defines the argument k0 of seqCluster. Otherwise, ks values are the k values for both the mainClustering and subsampling step (i.e. assigned to the subsampleArgs and mainClusterArgs that are passed to mainClustering and subsampleClustering unless k is set appropriately in subsampleArgs. The passing of these arguments via subsampleArgs will only have an effect if 'subsample=TRUE'. Similarly, the passing of mainClusterArgs[["k"]] will only have an effect when the clusterFunction argument includes a clustering algorithm of type "K". When/if "findBestK=TRUE", ks also defines the kRange argument of mainClustering unless kRange is specified by the user via the mainClusterArgs; note this means that the default option of setting kRange that depends on the input k (see mainClustering) is not available in clusterMany, only in clusterSingle.

If the input is a ClusterExperiment object, current implementation is that existing orderSamples,coClustering or the many dendrogram slots will be retained.

If run=FALSE, the function will still calculate reduced dimensions or filter statistics if not already calculated and saved in the object. Moreover the results of these calculations will not be save. Therefore, if these steps are lengthy for large datasets it is recommended to do them before calling the function.

The given reduceMethod values must either be all precalculated filtering/dimensionality reduction stored in the appropriate location, or must all be character values giving a built-in filtering/dimensionality reduction methods to be calculated. If some of the filtering/dimensionality methods are already calculated and stored, but not all, then they will all be recalculated (and if they are not all built-in methods, this will give an error). So to save computational time with pre-calculated dimensionality reduction, the user must make sure they are all precalculated. Also, user-defined values (i.e. not built-in functions) cannot be mixed with built-in functions unless they have already been precalculated (see makeFilterStats or makeReducedDims).

Value

If run=TRUE will return a ClusterExperiment object, where the results are stored as clusterings with clusterTypes clusterMany. Depending on eraseOld argument above, this will either delete existing such objects, or change the clusterTypes of existing objects. See argument eraseOld above. Arbitrarily the first clustering is set as the primaryClusteringIndex.

If run=FALSE a list with elements:

  • paramMatrix a matrix giving the parameters of each clustering, where each column is a possible parameter set by the user and passed to clusterSingle and each row of paramMatrix corresponds to a clustering in clMat

  • mainClusterArgs a list of (possibly modified) arguments to mainClusterArgs

  • seqArgs=seqArgsa list of (possibly modified) arguments to seqArgs

  • subsampleArgsa list of (possibly modified) arguments to subsampleArgs

Examples

## Not run: 
data(simData)

#Example: clustering using pam with different dimensions of pca and different
#k and whether remove negative silhouette values
#check how many and what runs user choices will imply:
checkParams <- clusterMany(simData,reduceMethod="PCA", makeMissingDiss=TRUE,
   nReducedDims=c(5,10,50), clusterFunction="pam", isCount=FALSE,
   ks=2:4,findBestK=c(TRUE,FALSE),removeSil=c(TRUE,FALSE),run=FALSE)
print(head(checkParams$paramMatrix))

#Now actually run it
cl <- clusterMany(simData,reduceMethod="PCA", nReducedDims=c(5,10,50),  isCount=FALSE,
   clusterFunction="pam",ks=2:4,findBestK=c(TRUE,FALSE),makeMissingDiss=TRUE, 
   removeSil=c(TRUE,FALSE))
print(cl)
head(colnames(clusterMatrix(cl)))

#make names shorter for plotting
clNames <- clusterLabels(cl)
clNames <- gsub("TRUE", "T", clNames)
clNames <- gsub("FALSE", "F", clNames)
clNames <- gsub("k=NA,", "", clNames)

par(mar=c(2, 10, 1, 1))
plotClusters(cl, axisLine=-2,clusterLabels=clNames)


#following code takes around 1+ minutes to run because of the subsampling
#that is redone each time:
system.time(clusterTrack <- clusterMany(simData, ks=2:15,
    alphas=c(0.1,0.2,0.3), findBestK=c(TRUE,FALSE), sequential=c(FALSE),
    subsample=c(FALSE), removeSil=c(TRUE), clusterFunction="pam", 
    makeMissingDiss=TRUE,
     mainClusterArgs=list(minSize=5, kRange=2:15), ncores=1, random.seed=48120))

## End(Not run)


epurdom/clusterCells documentation built on April 28, 2024, 8:14 p.m.