initialize_clusters: Cluster Initialization using a Heuristic Method

View source: R/utils.R

initialize_clustersR Documentation

Cluster Initialization using a Heuristic Method

Description

Initialize cluster memberships and component parameters to start the EM algorithm using a heuristic clustering method or user-defined labels.

Usage

initialize_clusters(
  X,
  G,
  init_method = c("kmedoids", "kmeans", "hierarchical", "mclust", "manual"),
  clusters = NULL
)

Arguments

X

An n x d matrix or data frame where n is the number of observations and d is the number of columns or variables. Alternately, X can be a vector of n observations.

G

The number of clusters, which must be at least 1. If G = 1, then user-defined clusters is ignored.

init_method

(optional) A string specifying the method to initialize the EM algorithm. "kmedoids" clustering is used by default. Alternative methods include "kmeans", "hierarchical", "manual". When "manual" is chosen, a vector clusters of length n must be specified. When G = 1 and "kmedoids" clustering is used, the medoid will be returned, not the sample mean.

clusters

A numeric vector of length n that specifies the initial cluster memberships of the user when init_method is set to "manual". This argument is NULL by default, so that it is ignored whenever other given initialization methods are chosen.

Details

Available heuristic methods include k-medoids clustering, k-means clustering, and hierarchical clustering. Alternately, the user can also enter pre-specified cluster memberships, making other initialization methods possible. If the given data set contains missing values, only observations with complete records will be used to initialize clusters. However, in this case, except when G = 1, the resulting cluster memberships will be set to NULL since they represent those complete records rather than the original data set as a whole.

Value

A list with the following slots:

pi

Component mixing proportions.

mu

A G by d matrix where each row is the component mean vector.

Sigma

A G-dimensional array where each d by d matrix is the component covariance matrix.

clusters

An numeric vector with values from 1 to G indicating initial cluster memberships if X is a complete data set; NULL otherwise.

References

Everitt, B., Landau, S., Leese, M., and Stahl, D. (2011). Cluster Analysis. John Wiley & Sons.

Kaufman, L. and Rousseeuw, P. J. (2009). Finding groups in data: an introduction to cluster analysis, volume 344. John Wiley & Sons.

Hartigan, J. A. and Wong, M. A. (1979). Algorithm AS 136: A K-means clustering algorithm. Applied Statistics, 28, 100-108. doi: 10.2307/2346830.

Examples


#++++ Initialization using a heuristic method ++++#

set.seed(1234)

init <- initialize_clusters(iris[1:4], G = 3)
init <- initialize_clusters(iris[1:4], G = 3, init_method = 'kmeans')
init <- initialize_clusters(iris[1:4], G = 3, init_method = 'hierarchical')

#++++ Initialization using user-defined labels ++++#

init <- initialize_clusters(iris[1:4], G = 3, init_method = 'manual',
                            clusters = as.numeric(iris$Species))

#++++ Initial parameters and pairwise scatterplot showing the mapping ++++#

init$pi
init$mu
init$Sigma
init$clusters

pairs(iris[1:4], col = init$clusters, pch = 16)


MixtureMissing documentation built on Oct. 16, 2024, 1:09 a.m.