seqCluster: Program for sequentially clustering, removing cluster, and...

View source: R/seqCluster.R

seqClusterR Documentation

Program for sequentially clustering, removing cluster, and starting again.

Description

Given a data matrix, this function will call clustering routines, and sequentially remove best clusters, and iterate to find clusters.

Usage

seqCluster(
  inputMatrix,
  inputType,
  k0,
  subsample = TRUE,
  beta,
  top.can = 0.01,
  remain.n = 30,
  k.min = 3,
  k.max = k0 + 10,
  verbose = TRUE,
  subsampleArgs = NULL,
  mainClusterArgs = NULL,
  warnings = FALSE
)

Arguments

inputMatrix

numerical matrix on which to run the clustering or a SummarizedExperiment, SingleCellExperiment, or ClusterExperiment object.

inputType

a character vector defining what type of input is given in the inputMatrix argument. Must consist of values "diss","X", or "cat" (see details). "X" and "cat" should be indicate matrices with features in the row and samples in the column; "cat" corresponds to the features being numerical integers corresponding to categories, while "X" are continuous valued features. "diss" corresponds to an inputMatrix that is a NxN dissimilarity matrix. "cat" is largely used internally for clustering of sets of clusterings.

k0

the value of K at the first iteration of sequential algorithm, see details below or vignette.

subsample

logical as to whether to subsample via subsampleClustering to get the distance matrix at each iteration; otherwise the distance matrix is set by arguments to mainClustering.

beta

value between 0 and 1 to decide how stable clustership membership has to be before 'finding' and removing the cluster.

top.can

only the top.can clusters from mainClustering (ranked by 'orderBy' argument given to mainClustering) will be compared pairwise for stability. Can be either an integer value, identifying the absolute number of clusters, or a value between 0 and 1, meaning to keep all clusters with at least this proportion of the remaining samples in the cluster. Making this either a very big integer or equal to 0 will effectively remove this parameter and all pairwise comparisons of all clusters found will be considered; this might result in smaller clusters being found. If top.can is between 0 and 1, then there is still a hard threshold of at least 5 samples in a cluster to be considered as a cluster.

remain.n

when only this number of samples are left (i.e. not yet clustered) then algorithm will stop.

k.min

each iteration of sequential detection of clustering will decrease the beginning K of subsampling, but not lower than k.min.

k.max

algorithm will stop if K in iteration is increased beyond this point.

verbose

whether the algorithm should print out information as to its progress.

subsampleArgs

list of arguments to be passed to subsampleClustering.

mainClusterArgs

list of arguments to be passed to mainClustering).

warnings

logical. Whether to print out the many possible warnings and messages regarding checking the internal consistency of the parameters.

Details

seqCluster is not meant to be called by the user. It is only an exported function so as to be able to clearly document the arguments for seqCluster which can be passed via the argument seqArgs in functions like clusterSingle and clusterMany.

This code is adapted from the sequential protion of the code of the tightClust package of Tseng and Wong. At each iteration of the algorithm it finds a set of samples that constitute a homogeneous cluster and remove them, and iterate again to find the next set of samples that form a cluster.

In each iteration, to determine the next set of homogeneous set of samples, the algorithm will iteratively cluster the current set of samples for a series of increasing values of the parameter $K$, starting at a value kinit and increasing by 1 at each iteration, until a sufficiently homogeneous set of clusters is found. For the first set of homogeneous samples, kinit is set to the argument $k0$, and for iteration, kinit is increased internally.

Depending on the value of subsample how the value of $K$ is used differs. If subsample=TRUE, $K$ is the k sent to the cluster function clusterFunction sent to subsampleClustering via subsampleArgs; then mainClustering is run on the result of the co-occurance matrix from subsampleClustering with the ClusterFunction object defined in the argument clusterFunction set via mainClusterArgs. The number of clusters actually resulting from this run of mainClustering may not be equal to the $K$ sent to the clustering done in subsampleClustering. If subsample=FALSE, mainClustering is called directly on the data to determine the clusters and $K$ set by seqCluster for this iteration determines the parameter of the clustering done by mainClustering. Specifically, the argument clusterFunction defines the clustering of the mainClustering step and k is sent to that ClusterFunction object. This means that if subsample=FALSE, the clusterFunction must be of algorithmType "K".

In either setting of subsample, the resulting clusters from mainClustering for a particular $K$ will be compared to clusters found in the previous iteration of $K-1$. For computational (and other?) convenience, only the first top.can clusters of each iteration will be compared to the first top.can clusters of previous iteration for similarity (where top.can currently refers to ordering by size, so first top.can largest clusters.

If there is no cluster of the first top.can in the current iteration $K$ that has overlap similarity > beta to any in the previous iteration, then the algorithm will move to the next iteration, increasing to $K+1$.

If, however, of these clusters there is a cluster in the current iteration $K$ that has overlap similarity > beta to a cluster in the previous iteration $K-1$, then the cluster with the largest such similarity will be identified as a homogenous set of samples and the samples in it will be removed and designated as such. The algorithm will then start again to determine the next set of homogenous samples, but without these samples. Furthermore, in this case (i.e. a cluster was found and removed), the value of kinit will be be reset to kinit-1; i.e. the range of increasing $K$ that will be iterated over to find a set of homogenous samples will start off one value less than was the case for the previous set of homogeneous samples. If kinit-1<k.min, then kinit will be set to k.min.

If there are less than remain.n samples left after finding a cluster and removing its samples, the algorithm will stop, as subsampling is deamed to no longer be appropriate. If the K has to be increased to beyond k.max without finding any pair of clusters with overlap > beta, then the algorithm will stop. Any samples not found as part of a homogenous set of clusters at that point will be classified as unclustered (given a value of -1)

Certain combinations of inputs to mainClusterArgs and subsampleArgs are not allowed. See clusterSingle for these explanations.

Value

A list with values

  • clustering a vector of length equal to nrows(x) giving the integer-valued cluster ids for each sample. The integer values are assigned in the order that the clusters were found. "-1" indicates the sample was not clustered.

  • clusterInfo if clusters were successfully found, a matrix of information regarding the algorithm behavior for each cluster (the starting and stopping K for each cluster, and the number of iterations for each cluster).

  • whyStop a character string explaining what triggered the algorithm to stop.

References

Tseng and Wong (2005), "Tight Clustering: A Resampling-Based Approach for Identifying Stable and Tight Patterns in Data", Biometrics, 61:10-16.

See Also

tight.clust, clusterSingle,mainClustering,subsampleClustering

Examples

## Not run: 
data(simData)

set.seed(12908)
clustSeqHier <- seqCluster(simData, inputType="X", k0=5, subsample=TRUE,
   beta=0.8, subsampleArgs=list(resamp.n=100,
   samp.p=0.7, clusterFunction="kmeans", clusterArgs=list(nstart=10)),
   mainClusterArgs=list(minSize=5,clusterFunction="hierarchical01",
   clusterArgs=list(alpha=0.1)))

## End(Not run)

epurdom/clusterCells documentation built on April 28, 2024, 8:14 p.m.