kmeans_clustering: K-means clustering
In gerbeldo/tidycell: CellID Data Analysis Inside the Tidyverse

kmeans_clustering

R Documentation

K-means clustering

Description

Perform k-means clustering on Cell-ID data.

Usage

kmeans_clustering(
  x,
  k = 10,
  max_iter = 100,
  resume = FALSE,
  label_col = "k",
  var_cats = NULL,
  custom_vars = NULL,
  plot_progress = F,
  return_list = F
)

Arguments

`x`	cell.data object or a cell.data data.frame
`k`	either a non-negative integer setting the desired number of clusters, or a data.frame with `ucid` and `t.frame` pairs specifying the rows in `x` to be used as starting centroids.
`max_iter`	The maximum number of iterations allowed.
`resume`	logical. If `TRUE` the algorithm picks up clustering from pre-assigned clusters found in a column with name `k` (default), or by the column name passed by the `label_col` argument.
`label_col`	optional string specifying the column containing pre-defined clusters used when `resume` is set to `TRUE`. This overrides the default column `k`.
`var_cats`	optional character vector specifying whether pre-defined sets of morphological (`morpho`) and/or fluorescence (`fluor`) variables should be included for clustering. If no value is given and `custom_vars` is empty, this defaults to `morpho`.
`custom_vars`	optional character vector specifying custom variables to be included for clustering. These are added to any variable sets specified by `var_cats`.

Details

K-means clusters data by assigning each row to the nearest cluster based on its Euclidean distance to the center (centroid) of all clusters. After assigning all rows, centroid positions are updated by calculating the column means of all rows assigned to each cluster. Row assignment and centroid updates are performed iteratively until the algorithm converges, i.e., no rows are re-assigned after centroid positions have been updated.

The number of clusters is defined by the parameter k, and clustering can be either completely unsupervised (k is a number only setting the desired number of clusters), or semi-supervised (k is a data.frame of ucid and t.frame pairs defining which rows/cells to choose as starting centroids). If unsupervised, starting centroids are chosen randomly by sampling k rows from data. Semi-supervised clustering can also be achieved by indicating a column of pre-defined labels assigned to a subset of rows, which will then be used to calculate the positions of the starting centroids.

Note that this algorithm does not guarantee to find the optimum.

Value

Depending on the data type provided by x, either a cell.data object or a cell.data data.frame with appended columns k and k.dist, indicating the assigned cluster and Euclidean distance to the cluster centroid, respectively.

gerbeldo/tidycell documentation built on Aug. 15, 2022, 2:35 p.m.