nroKmeans | R Documentation |
K-means clustering for multi-dimensional data.
nroKmeans(data, k = 3, subsample = NULL, balance = 0, message = NULL)
data |
A data frame or a matrix. |
k |
Number of centroids. |
subsample |
Number of randomly selected rows used during a single training cycle. |
balance |
Penalty parameter for size difference between clusters. |
message |
If positive, progress information is printed at the specified interval in seconds. |
The K centroids are determined by Lloyd's algorithm with Euclidean distances or by using 1 - Pearson correlation as the distance measure.
If subsample
is less than the number of data rows, a random subset of
the specified size is used for each training cycle. By default,
subsample
is set automatically depending on the size of the dataset.
If balance = 0.0
, the algorithm is applied with no balancing,
if balance = 1.0
all the clusters will be forced to be of equal size.
Intermediate values are permitted. Note that if subsampling is applied,
balancing may become less accurate.
A list with named elements: centroids
is a matrix of the
main results, layout
contains the best-matching centroid labels
and model residuals for each usable data point and history
is the
chronological record of training errors. The subsampling parameter that was
used during training is stored in the element subsample
.
# Import data.
fname <- system.file("extdata", "finndiane.txt", package = "Numero")
dataset <- read.delim(file = fname)
# Prepare training data.
trvars <- c("CHOL", "HDL2C", "TG", "CREAT", "uALB")
trdata <- scale.default(dataset[,trvars])
# Unbalanced K-means clustering.
km0 <- nroKmeans(data = trdata, k = 5, balance = 0.0)
print(table(km0$layout$BMC))
print(km0$centroids)
# Balanced K-means clustering.
km1 <- nroKmeans(data = trdata, k = 5, balance = 1.0)
print(table(km1$layout$BMC))
print(km1$centroids)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.