clusterMany: Create a matrix of clustering across values of parameters
In epurdom/clusterCells: Compare Clusterings for Single-Cell Sequencing

clusterMany

R Documentation

Create a matrix of clustering across values of parameters

Description

Given a range of parameters, this function will return a matrix with the clustering of the samples across the range, which can be passed to plotClusters for visualization.

Usage

## S4 method for signature 'matrixOrHDF5'
clusterMany(
  x,
  reduceMethod = "none",
  nReducedDims = NA,
  transFun = NULL,
  isCount = FALSE,
  ...
)

## S4 method for signature 'SingleCellExperiment'
clusterMany(
  x,
  ks = NA,
  clusterFunction,
  reduceMethod = "none",
  nFilterDims = defaultNDims(x, reduceMethod, type = "filterStats"),
  nReducedDims = defaultNDims(x, reduceMethod, type = "reducedDims"),
  alphas = 0.1,
  findBestK = FALSE,
  sequential = FALSE,
  removeSil = FALSE,
  subsample = FALSE,
  silCutoff = 0,
  distFunction = NA,
  betas = 0.9,
  minSizes = 1,
  transFun = NULL,
  isCount = FALSE,
  verbose = TRUE,
  parameterWarnings = FALSE,
  mainClusterArgs = NULL,
  subsampleArgs = NULL,
  seqArgs = NULL,
  whichAssay = 1,
  makeMissingDiss = if (ncol(x) < 1000) TRUE else FALSE,
  ncores = 1,
  random.seed = NULL,
  run = TRUE,
  ...
)

## S4 method for signature 'ClusterExperiment'
clusterMany(
  x,
  reduceMethod = "none",
  nFilterDims = defaultNDims(x, reduceMethod, type = "filterStats"),
  nReducedDims = defaultNDims(x, reduceMethod, type = "reducedDims"),
  eraseOld = FALSE,
  ...
)

## S4 method for signature 'SummarizedExperiment'
clusterMany(x, ...)

## S4 method for signature 'data.frame'
clusterMany(x, ...)

Arguments

`x`	the data matrix on which to run the clustering. Can be object of the following classes: matrix (with genes in rows), `SummarizedExperiment`, `SingleCellExperiment` or `ClusterExperiment`.
`reduceMethod`	character A character identifying what type of dimensionality reduction to perform before clustering. Options are 1) "none", 2) one of listBuiltInReducedDims() or listBuiltInFitlerStats OR 3) stored filtering or reducedDim values in the object.
`nReducedDims`	vector of the number of dimensions to use (when `reduceMethod` gives a dimensionality reduction method).
`transFun`	a transformation function to be applied to the data. If the transformation applied to the data creates an error or NA values, then the function will throw an error. If object is of class `ClusterExperiment`, the stored transformation will be used and giving this parameter will result in an error.
`isCount`	if `transFun=NULL`, then `isCount=TRUE` will determine the transformation as defined by `function(x){log2(x+1)}`, and `isCount=FALSE` will give a transformation function `function(x){x}`. Ignored if `transFun=NULL`. If object is of class `ClusterExperiment`, the stored transformation will be used and giving this parameter will result in an error.
`...`	For signature `matrix`, arguments to be passed on to mclapply (if ncores>1). For all the other signatures, arguments to be passed to the method for signature `matrix`.
`ks`	the range of k values (see details for the meaning of `k` for different choices of other parameters).
`clusterFunction`	function used for the clustering. This must be either 1) a character vector of built-in clustering techniques, or 2) a named list of `ClusterFunction` objects. Current functions can be found by typing `listBuiltInFunctions()` into the command-line.
`nFilterDims`	vector of the number of the most variable features to keep (when "var", "abscv", or "mad" is identified in `reduceMethod`).
`alphas`	values of alpha to be tried. Only used for clusterFunctions of type '01'. Determines tightness required in creating clusters from the dissimilarity matrix. Takes on values in [0,1]. See documentation of `ClusterFunction`.
`findBestK`	logical, whether should find best K based on average silhouette width (only used when clusterFunction of type "K").
`sequential`	logical whether to use the sequential strategy (see details of `seqCluster`). Can be used in combination with `subsample=TRUE` or `FALSE`.
`removeSil`	logical as to whether remove when silhouette < silCutoff (only used if clusterFunction of type "K")
`subsample`	logical as to whether to subsample via `subsampleClustering`. If TRUE, clustering in mainClustering step is done on the co-occurance between clusterings in the subsampled clustering results. If FALSE, the mainClustering step will be run directly on `x`/`diss`
`silCutoff`	Requirement on minimum silhouette width to be included in cluster (only for combinations where removeSil=TRUE).
`distFunction`	a vector of character strings that are the names of distance functions found in the global environment. See the help pages of `clusterSingle` for details about the required format of distance functions. Currently, this distance function must be applicable for all clusterFunction types tried. Therefore, it is not possible in `clusterMany` to intermix type "K" and type "01" algorithms if you also give distances to evaluate via `distFunction` unless all distances give 0-1 values for the distance (and hence are possible for both type "01" and "K" algorithms).
`betas`	values of `beta` to be tried in sequential steps. Only used for `sequential=TRUE`. Determines the similarity between two clusters required in order to deem the cluster stable. Takes on values in [0,1]. See documentation of `seqCluster`.
`minSizes`	the minimimum size required for a cluster (in the `mainClustering` step). Clusters smaller than this are not kept and samples are left unassigned.
`verbose`	logical. If TRUE it will print informative messages.
`parameterWarnings`	logical, as to whether warnings and comments from checking the validity of the parameter combinations should be printed.
`mainClusterArgs`	list of arguments to be passed for the mainClustering step, see help pages of `mainClustering`.
`subsampleArgs`	list of arguments to be passed to the subsampling step (if `subsample=TRUE`), see help pages of `subsampleClustering`.
`seqArgs`	list of arguments to be passed to `seqCluster`.
`whichAssay`	numeric or character specifying which assay to use. See `assay` for details.
`makeMissingDiss`	logical. Whether to calculate necessary distance matrices needed when input is not "diss". If TRUE, then when a clustering function calls for a inputType "diss", but the given matrix is of type "X", the function will calculate a distance function. A dissimilarity matrix will also be calculated if a post-processing argument like `findBestK` or `removeSil` is chosen, since these rely on calcualting silhouette widths from distances.
`ncores`	the number of threads
`random.seed`	a value to set seed before each run of clusterSingle (so that all of the runs are run on the same subsample of the data). Note, if 'random.seed' is set, argument 'ncores' should NOT be passed via subsampleArgs; instead set the argument 'ncores' of clusterMany directly (which is preferred for improving speed anyway).
`run`	logical. If FALSE, doesn't run clustering, but just returns matrix of parameters that will be run, for the purpose of inspection by user (with rownames equal to the names of the resulting column names of clMat object that would be returned if `run=TRUE`). Even if `run=FALSE`, however, the function will create the dimensionality reductions of the data indicated by the user input.
`eraseOld`	logical. Only relevant if input `x` is of class `ClusterExperiment`. If TRUE, will erase existing workflow results (clusterMany as well as mergeClusters and makeConsensus). If FALSE, existing workflow results will have "`_i`" added to the clusterTypes value, where `i` is one more than the largest such existing workflow clusterTypes.

Details

Some combinations of these parameters are not feasible. See the documentation of clusterSingle for important information on how these parameter choices interact.

While the function allows for multiple values of clusterFunction, the code does not reuse the same subsampling matrix and try different clusterFunctions on it. This is because if sequential=TRUE, different subsample clusterFunctions will create different sets of data to subsample so it is not possible; if sequential=FALSE, we have not implemented functionality for this reuse. Setting the random.seed value, however, should mean that the subsampled matrix is the same for each, but there is no gain in computational complexity (i.e. each subsampled co-occurence matrix is recalculated for each set of parameters).

The argument ks is interpreted differently for different choices of the other parameters. When/if sequential=TRUE, ks defines the argument k0 of seqCluster. Otherwise, ks values are the k values for both the mainClustering and subsampling step (i.e. assigned to the subsampleArgs and mainClusterArgs that are passed to mainClustering and subsampleClustering unless k is set appropriately in subsampleArgs. The passing of these arguments via subsampleArgs will only have an effect if 'subsample=TRUE'. Similarly, the passing of mainClusterArgs[["k"]] will only have an effect when the clusterFunction argument includes a clustering algorithm of type "K". When/if "findBestK=TRUE", ks also defines the kRange argument of mainClustering unless kRange is specified by the user via the mainClusterArgs; note this means that the default option of setting kRange that depends on the input k (see mainClustering) is not available in clusterMany, only in clusterSingle.

If the input is a ClusterExperiment object, current implementation is that existing orderSamples,coClustering or the many dendrogram slots will be retained.

If run=FALSE, the function will still calculate reduced dimensions or filter statistics if not already calculated and saved in the object. Moreover the results of these calculations will not be save. Therefore, if these steps are lengthy for large datasets it is recommended to do them before calling the function.

The given reduceMethod values must either be all precalculated filtering/dimensionality reduction stored in the appropriate location, or must all be character values giving a built-in filtering/dimensionality reduction methods to be calculated. If some of the filtering/dimensionality methods are already calculated and stored, but not all, then they will all be recalculated (and if they are not all built-in methods, this will give an error). So to save computational time with pre-calculated dimensionality reduction, the user must make sure they are all precalculated. Also, user-defined values (i.e. not built-in functions) cannot be mixed with built-in functions unless they have already been precalculated (see makeFilterStats or makeReducedDims).

Value

If run=TRUE will return a ClusterExperiment object, where the results are stored as clusterings with clusterTypes clusterMany. Depending on eraseOld argument above, this will either delete existing such objects, or change the clusterTypes of existing objects. See argument eraseOld above. Arbitrarily the first clustering is set as the primaryClusteringIndex.

If run=FALSE a list with elements:

paramMatrix a matrix giving the parameters of each clustering, where each column is a possible parameter set by the user and passed to clusterSingle and each row of paramMatrix corresponds to a clustering in clMat
mainClusterArgs a list of (possibly modified) arguments to mainClusterArgs
seqArgs=seqArgsa list of (possibly modified) arguments to seqArgs
subsampleArgsa list of (possibly modified) arguments to subsampleArgs

Examples

## Not run: 
data(simData)

#Example: clustering using pam with different dimensions of pca and different
#k and whether remove negative silhouette values
#check how many and what runs user choices will imply:
checkParams <- clusterMany(simData,reduceMethod="PCA", makeMissingDiss=TRUE,
   nReducedDims=c(5,10,50), clusterFunction="pam", isCount=FALSE,
   ks=2:4,findBestK=c(TRUE,FALSE),removeSil=c(TRUE,FALSE),run=FALSE)
print(head(checkParams$paramMatrix))

#Now actually run it
cl <- clusterMany(simData,reduceMethod="PCA", nReducedDims=c(5,10,50),  isCount=FALSE,
   clusterFunction="pam",ks=2:4,findBestK=c(TRUE,FALSE),makeMissingDiss=TRUE, 
   removeSil=c(TRUE,FALSE))
print(cl)
head(colnames(clusterMatrix(cl)))

#make names shorter for plotting
clNames <- clusterLabels(cl)
clNames <- gsub("TRUE", "T", clNames)
clNames <- gsub("FALSE", "F", clNames)
clNames <- gsub("k=NA,", "", clNames)

par(mar=c(2, 10, 1, 1))
plotClusters(cl, axisLine=-2,clusterLabels=clNames)


#following code takes around 1+ minutes to run because of the subsampling
#that is redone each time:
system.time(clusterTrack <- clusterMany(simData, ks=2:15,
    alphas=c(0.1,0.2,0.3), findBestK=c(TRUE,FALSE), sequential=c(FALSE),
    subsample=c(FALSE), removeSil=c(TRUE), clusterFunction="pam", 
    makeMissingDiss=TRUE,
     mainClusterArgs=list(minSize=5, kRange=2:15), ncores=1, random.seed=48120))

## End(Not run)

epurdom/clusterCells documentation built on April 28, 2024, 8:14 p.m.