Description Usage Arguments Details Value Note Author(s)
Create optimal k-mean labels for a dataset.
1 | kmean_samesize(df, k, nstart = 25)
|
df |
dataset to generate k-means labels |
k |
number of desire clusters |
nstart |
number of k-mean iterations for the centroids creation. |
Calculate the k-means algorithm on numeric or integar variables from the
dataset df
.
Calculate the euclidean distance between the row df points and each centroid.
(x0,y0,z0,..),(x1,y1,z1,...),..,(xk,yk,zk,...) k centroids.
We are going to have k df s. Therefore, we use cbind.data.frame to create a df where each df is a variable with length of the original df.
3.variables used:
cardinality_sample: number of elements that are going to be for each cluster (label).
ctrl_clstr_no_elmnts: this variable helps to control the balance cluster size.
label: result balanced label that is going to be created.
4.For each iteration (each df row):
4.1. We obtain the row index where is the minimum euclidean distance in that row of the df 2. defined as bestcluster and in that index we add 1 to the observation of ctrl_clstr_no_elmnts variable representing adding 1 intended top to k.
example:
iteration 1 - bestcluster = 2
ctrl_clstr_no_elmnts = 0 1 0 0 0 0 | row 1 of dataset euclidean_distance
iteration 2 - bestcluster = 2
ctrl_clstr_no_elmnts = 0 2 0 0 0 0 | row 2 of dataset euclidean_distance
iteration 3 - bestcluster = 3
ctrl_clstr_no_elmnts = 0 2 1 0 0 0 | row 3 of dataset euclidean_distance
iteration 4 - bestcluster = 2
ctrl_clstr_no_elmnts = 0 3 1 0 0 0 | row 4 of dataset euclidean_distance
4.2.Add if statement to control and balance the size of each cluster based on ctrl_clstr_no_elmnts variable. If in the index bestcluster of the observation of ctrl_clstr_no_elmnts is greater than cardinality_sample then define the variable on index bestcluster of df2. as NA to not affect the next iteration and have only the columns in euclidean_distance where ctrl_clstr_no_elmnts is less than cardinality_sample
balanced k-means label for each observation of the dataset
The kmeans algorithm assigns a label for each observation. However, it is not necessary to have balanced clusters in the amount of observations that is why we use this function.
Kmeans algorithm assign randomly the centroids based on the
iterations of nstart
. This is why every time this
function runs the output would have a different order in the labels.
It is recommendable that the size of the population is divisible by
the number of clusters k
to have equal number of elements in
each cluster.
Eduardo Trujillo
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.