KMeansClustering: K-Means Clustering

KMeansClusteringR Documentation

K-Means Clustering

Description

This function can be used to train a k-means clustering machine learning model in R.

Details

K Means Clustering is widely used Unsupervised Machine Learning Algorithm, this function can be used to perform unsuperwised Clustering or Labelling based on KMC algorithm. This package imports the mini-batch k-means function from ClusterR package which has been developed and written in C++, therefore it is computationally very fast.

Public fields

clusters

indicate the number of clusters, this is a hyperparameter and must be tuned.

b_size

indicates the the size of the mini batches to be used while fitting the model.

num_rep

indicates the number of times the algorithm shall be run each time with the different centroid seeds chosen randomly.

max_iterations

indicate the maximum number of epochs performed for clustering.

init_fraction

indicates the total percentage of data to be used for the purpose of initialization of the random centroids points, it applies if initializer is set to kmeans++. It shall be of type float with in range of 0 to 1.

initializer

this indicates the method that has been used for the initialization of the centeroids. It can take values of kmeans++, optimal_init, or quantile_init, ususally kmeans++ is used.

early_stop_iterations

indicate the contination foe running the algorithm for given number of iterations after finding one of the best within-cluster-sum-ofsquared-error.

This

field indicates if you want to the progress to be printed on the console or not, It shall be logical either TRUE or FALSE.

centroids

is a matrix of initial cluster centroids. The columns shall be equal to the features in the data and the rows shall be equal to the number of centeroids or clusters.

tolerance

shall be a floating number, in case is an iteration number is > 1 and iteration number is < max_itererations and the tolerance is greater than the squared norm of the centroids, then this is an indication that kmeans clustering algorithm has converged

tolerance_optimal_init

is the tolerance value for the optimal_init type of initializer, the greater value is an indication of well separated clusters.

seed

shall be an integer value for Random Number Generator.

model

this is used for internal purpose for superml.

max_clusters

this can be either a numeric , a contiguous or non-continguous numeric vector specifying search space of the clusters.

Active bindings

This

field indicates if you want to the progress to be printed on the console or not, It shall be logical either TRUE or FALSE.

Methods

Public methods


Method new()

Usage
KMeansClustering$new(
  clusters,
  b_size = 10,
  num_rep = 1,
  max_iterations = 100,
  init_fraction = 1,
  initializer = "kmeans++",
  early_stop_iterations = 10,
  verbose = FALSE,
  centroids = NULL,
  tolerance = 1e-04,
  tolerance_optimal_init = 0.3,
  seed = 1,
  max_clusters = NA
)
Arguments
clusters

It shall be of type numeric, the value must lie between 0 and 1.

b_size

It shall be of type nuemric, indicates the mini batch size for minibatch C++ package.

num_rep

It shall be of type integer, indicates the number of times the algorithm shall be run each time with the different centroid seeds chosen randomly.

max_iterations

It shall be of type integer indicating maximum number of iterations to be performed.

init_fraction

It shall be of type float,init_fraction indicates the total percentage of data to be used for the purpose of initialization of the random centroids points, it applies if initializer is set to kmeans++. It shall be of type float with in range of 0 to 1.

initializer

It shall be of type character,indicating the initiazer for centeroids most famous is kmeans++.

early_stop_iterations

It shall be of type integer, indication to run the algorithm for number of given interations after the best within-cluster-sum-ofsquared-error has been achieved.

verbose

It shall be of type logical, either TRUE or FALSE, indicating whether progress shall be printed to the console during calculations.

centroids

It shall be a matrix with entities of type integer for float, indicating the initial cluster centroids.

tolerance

It shall be of type float, in case is an iteration number is > 1 and iteration number is < max_itererations and the tolerance is greater than the squared norm of the centroids, then this is an indication that kmeans clustering algorithm has converged a float number. If, in case of an iteration (iteration > 1 and iteration < max_iters) "tol" is greater than the squared norm of the centroids, then kmeans has converged

tolerance_optimal_init

It shall be of type float, tolerance_optimal_init is the tolerance value for the optimal_init type of initializer, the greater value is an indication of well separated clusters.

seed

Its shall be of type integer, indicating the value for Random Number Generator.

max_clusters

max_clusters can be either a numeric , a contiguous or non-continguous numeric vector specifying search space of the clusters.

Details

Create a new KMeansClustering object.

Returns

A KMeansClustering object.

Examples
data_set <- rbind(replicate(30, rnorm(1e4, 3)),
             replicate(30, rnorm(1e4, -1)),
             replicate(30, rnorm(1e4, 5)))
km <- KMeansClustering$new(clusters=2, b_size=30, max_clusters=6)

Method fit()

Usage
KMeansClustering$fit(X_data, y = NULL, find_optimal = FALSE)
Arguments
X_data

X_data shall be either a data.frame or a matrix containing the features of interest.

y

y is set to NULL only kept here because of superml general e:g way for every x you have to map it to y.

find_optimal

find_optimal shall be logical, it indicates to search the optimal clusters automatically.

Details

This functions fits the KMeansClustering model

Returns

NULL

Examples
data_set <- rbind(replicate(30, rnorm(1e4, 3)),
             replicate(30, rnorm(1e4, -1)),
             replicate(30, rnorm(1e4, 5)))
km <- KMeansClustering$new(clusters=2, b_size=30, max_clusters=6)
km$fit(data_set, find_optimal = FALSE)

Method predict()

Usage
KMeansClustering$predict(X_data)
Arguments
X_data

it shall be an R Data Frame or Matrix

Details

Returns the prediction on the provided data.

Returns

a vector containing predictions

Examples
data_set <- rbind(replicate(30, rnorm(1e4, 2)),
             replicate(30, rnorm(1e4, -1)),
             replicate(30, rnorm(1e4, 5)))
km <- KMeansClustering$new(clusters=2, b_size=30, max_clusters=6)
km$fit(data_set, find_optimal = FALSE)
preds <- km$predict(data_set)

Method clone()

The objects of this class are cloneable with this method.

Usage
KMeansClustering$clone(deep = FALSE)
Arguments
deep

Whether to make a deep clone.

Examples


## ------------------------------------------------
## Method `KMeansClustering$new`
## ------------------------------------------------

data_set <- rbind(replicate(30, rnorm(1e4, 3)),
             replicate(30, rnorm(1e4, -1)),
             replicate(30, rnorm(1e4, 5)))
km <- KMeansClustering$new(clusters=2, b_size=30, max_clusters=6)

## ------------------------------------------------
## Method `KMeansClustering$fit`
## ------------------------------------------------

data_set <- rbind(replicate(30, rnorm(1e4, 3)),
             replicate(30, rnorm(1e4, -1)),
             replicate(30, rnorm(1e4, 5)))
km <- KMeansClustering$new(clusters=2, b_size=30, max_clusters=6)
km$fit(data_set, find_optimal = FALSE)

## ------------------------------------------------
## Method `KMeansClustering$predict`
## ------------------------------------------------

data_set <- rbind(replicate(30, rnorm(1e4, 2)),
             replicate(30, rnorm(1e4, -1)),
             replicate(30, rnorm(1e4, 5)))
km <- KMeansClustering$new(clusters=2, b_size=30, max_clusters=6)
km$fit(data_set, find_optimal = FALSE)
preds <- km$predict(data_set)

MalikShahidSultan/machinelearning documentation built on May 9, 2022, 8:32 p.m.