kmeans_clustering: K-means clustering

View source: R/k-means_clustering.R

kmeans_clusteringR Documentation

K-means clustering

Description

Perform k-means clustering on Cell-ID data.

Usage

kmeans_clustering(
  x,
  k = 10,
  max_iter = 100,
  resume = FALSE,
  label_col = "k",
  var_cats = NULL,
  custom_vars = NULL,
  plot_progress = F,
  return_list = F
)

Arguments

x

cell.data object or a cell.data data.frame

k

either a non-negative integer setting the desired number of clusters, or a data.frame with ucid and t.frame pairs specifying the rows in x to be used as starting centroids.

max_iter

The maximum number of iterations allowed.

resume

logical. If TRUE the algorithm picks up clustering from pre-assigned clusters found in a column with name k (default), or by the column name passed by the label_col argument.

label_col

optional string specifying the column containing pre-defined clusters used when resume is set to TRUE. This overrides the default column k.

var_cats

optional character vector specifying whether pre-defined sets of morphological (morpho) and/or fluorescence (fluor) variables should be included for clustering. If no value is given and custom_vars is empty, this defaults to morpho.

custom_vars

optional character vector specifying custom variables to be included for clustering. These are added to any variable sets specified by var_cats.

Details

K-means clusters data by assigning each row to the nearest cluster based on its Euclidean distance to the center (centroid) of all clusters. After assigning all rows, centroid positions are updated by calculating the column means of all rows assigned to each cluster. Row assignment and centroid updates are performed iteratively until the algorithm converges, i.e., no rows are re-assigned after centroid positions have been updated.

The number of clusters is defined by the parameter k, and clustering can be either completely unsupervised (k is a number only setting the desired number of clusters), or semi-supervised (k is a data.frame of ucid and t.frame pairs defining which rows/cells to choose as starting centroids). If unsupervised, starting centroids are chosen randomly by sampling k rows from data. Semi-supervised clustering can also be achieved by indicating a column of pre-defined labels assigned to a subset of rows, which will then be used to calculate the positions of the starting centroids.

Note that this algorithm does not guarantee to find the optimum.

Value

Depending on the data type provided by x, either a cell.data object or a cell.data data.frame with appended columns k and k.dist, indicating the assigned cluster and Euclidean distance to the cluster centroid, respectively.


gerbeldo/tidycell documentation built on Aug. 15, 2022, 2:35 p.m.