mbkmeans: Mini-Batch k-means for large single cell sequencing data
In mbkmeans: Mini-batch K-means Clustering for Single-Cell RNA-seq

Description Usage Arguments Details Value Author(s) References Examples

This is an implementation of the mini-batch k-means algorithm of Sculley (2010) for large single cell sequencing data with the dimensionality reduction results as input in the reducedDim() slot.

mbkmeans(x, ...)

## S4 method for signature 'SummarizedExperiment'
mbkmeans(x, whichAssay = 1, ...)

## S4 method for signature 'SingleCellExperiment'
mbkmeans(x, reduceMethod = "PCA", whichAssay = 1, ...)

## S4 method for signature 'LinearEmbeddingMatrix'
mbkmeans(x, ...)

## S4 method for signature 'ANY'
mbkmeans(
  x,
  clusters,
  batch_size = min(500, NCOL(x)),
  max_iters = 100,
  num_init = 1,
  init_fraction = batch_size/NCOL(x),
  initializer = "kmeans++",
  compute_labels = TRUE,
  calc_wcss = FALSE,
  early_stop_iter = 10,
  verbose = FALSE,
  CENTROIDS = NULL,
  tol = 1e-04,
  BPPARAM = BiocParallel::SerialParam(),
  ...
)

`x`	The object on which to run mini-batch k-means. It can be a matrix-like object (e.g., matrix, Matrix, DelayedMatrix, HDF5Matrix) with genes in the rows and samples in the columns. Specialized methods are defined for SummarizedExperiment and SingleCellExperiment.
`...`	passed to 'blockApply'.
`whichAssay`	The assay to use as input to mini-batch k-means. If x is a SingleCellExperiment, this is ignored unless `reduceMethod = NA`.
`reduceMethod`	Name of dimensionality reduction results to use as input to mini-batch k-means. Set to NA to use the full matrix.
`clusters`	the number of clusters
`batch_size`	the size of the mini batches. By default, it equals the minimum between the number of observations and 500.
`max_iters`	the maximum number of clustering iterations
`num_init`	number of times the algorithm will be run with different centroid seeds
`init_fraction`	proportion of data to use for the initialization centroids (applies if initializer is kmeans++ ). Should be a float number between 0.0 and 1.0. By default, it uses the relative batch size.
`initializer`	the method of initialization. One of kmeans++ and random. See details for more information
`compute_labels`	logcical indicating whether to compute the final cluster labels.
`calc_wcss`	logical indicating whether the per-cluster WCSS is computed. Ignored if 'compute_labels = FALSE'.
`early_stop_iter`	continue that many iterations after calculation of the best within-cluster-sum-of-squared-error
`verbose`	either TRUE or FALSE, indicating whether progress is printed during clustering
`CENTROIDS`	a matrix of initial cluster centroids. The rows of the CENTROIDS matrix should be equal to the number of clusters and the columns should be equal to the columns of the data
`tol`	a float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) 'tol' is greater than the squared norm of the centroids, then kmeans has converged
`BPPARAM`	See the 'BiocParallel' package. Only the label assignment is done in parallel.

The implementation is largely based on the MiniBatchKmeans function of the ClusterR package. The contribution of this package is to provide support for on-disk data representations such as HDF5, through the use of DelayedMatrix and HDF5Matrix objects, as well as for sparse data representation through the classes of the Matrix package. We also provide high-level methods for objects of class SummarizedExperiment, SingleCellExperiment, and LinearEmbeddingMatrix.

This function performs k-means clustering using mini batches.

kmeans++: kmeans++ initialization. Reference : http://theory.stanford.edu/~sergei/papers/kMeansPP-soda.pdf AND http://stackoverflow.com/questions/5466323/how-exactly-does-k-means-work

random: random selection of data rows as initial centroids

A list with the following attributes: centroids, WCSS_per_cluster, best_initialization, iters_per_initialization.

a list with the following attributes: centroids, WCSS_per_cluster, best_initialization, iters_per_initialization

Lampros Mouselimis and Yuwei Ni

Sculley. Web-Scale K-Means Clustering. WWW 2010, April 26–30, 2010, Raleigh, North Carolina, USA. ACM 978-1-60558-799-8/10/04.

https://github.com/mlampros/ClusterR

library(SummarizedExperiment)
se <- SummarizedExperiment(matrix(rnorm(100), ncol=10))
mbkmeans(se, clusters = 2)
library(SingleCellExperiment)
sce <- SingleCellExperiment(matrix(rnorm(100), ncol=10))
mbkmeans(sce, clusters = 2, reduceMethod = NA)
x<-matrix(rnorm(100), ncol=10)
mbkmeans(x,clusters = 3)