TGL_kmeans: kmeans++ with return value similar to R kmeans
In tanaylab/tglkmeans: Efficient Implementation of K-Means++ Algorithm

View source: R/TGL_kmeans.R

TGL_kmeans

R Documentation

kmeans++ with return value similar to R kmeans

Description

kmeans++ with return value similar to R kmeans

Usage

TGL_kmeans(
  df,
  k,
  metric = "euclid",
  max_iter = 40,
  min_delta = 0.0001,
  verbose = FALSE,
  keep_log = FALSE,
  id_column = FALSE,
  reorder_func = "hclust",
  hclust_intra_clusters = FALSE,
  seed = NULL,
  use_cpp_random = FALSE
)

Arguments

`df`	a data frame or a matrix. Each row is a single observation and each column is a dimension. the first column can contain id for each observation (if id_column is TRUE), otherwise the rownames are used.
`k`	number of clusters. Note that in some cases the algorithm might return less clusters than k.
`metric`	distance metric for kmeans++ seeding. can be 'euclid', 'pearson' or 'spearman'
`max_iter`	maximal number of iterations
`min_delta`	minimal change in assignments (fraction out of all observations) to continue iterating
`verbose`	display algorithm messages
`keep_log`	keep algorithm messages in 'log' field
`id_column`	`df`'s first column contains the observation id
`reorder_func`	function to reorder the clusters. operates on each center and orders by the result. e.g. `reorder_func = mean` would calculate the mean of each center and then would reorder the clusters accordingly. If `reorder_func = hclust` the centers would be ordered by hclust of the euclidean distance of the correlation matrix, i.e. `hclust(dist(cor(t(centers))))` if NULL, no reordering would be done.
`hclust_intra_clusters`	run hierarchical clustering within each cluster and return an ordering of the observations.
`seed`	seed for the c++ random number generator
`use_cpp_random`	use c++ random number generator instead of R's. This should be used for only for backwards compatibility, as from version 0.4.0 onwards the default random number generator was changed o R.

Value

list with the following components:

cluster:: A vector of integers (from ‘1:k’) indicating the cluster to which each point is allocated.
centers:: A matrix of cluster centers.
size:: The number of points in each cluster.
log:: messages from the algorithm run (only if id_column == TRUE).
order:: A vector of integers with the new ordering if the observations. (only if hclust_intra_clusters = TRUE)

Examples



# create 5 clusters normally distributed around 1:5
d <- simulate_data(
    n = 100,
    sd = 0.3,
    nclust = 5,
    dims = 2,
    add_true_clust = FALSE,
    id_column = FALSE
)

head(d)

# cluster
km <- TGL_kmeans(d, k = 5, "euclid", verbose = TRUE)
names(km)
km$centers
head(km$cluster)
km$size

tanaylab/tglkmeans documentation built on Jan. 24, 2025, 7:23 a.m.