kmean_samesize: kmean_samesize
In 1Edtrujillo1/udeploy: Back-end & Front-end functions

Create optimal k-mean labels for a dataset.

1	kmean_samesize(df, k, nstart = 25)

`df`	dataset to generate k-means labels
`k`	number of desire clusters
`nstart`	number of k-mean iterations for the centroids creation.

Calculate the k-means algorithm on numeric or integar variables from the dataset df.
Calculate the euclidean distance between the row df points and each centroid.

(x0,y0,z0,..),(x1,y1,z1,...),..,(xk,yk,zk,...) k centroids.
We are going to have k df s. Therefore, we use cbind.data.frame to create a df where each df is a variable with length of the original df.

cardinality_sample: number of elements that are going to be for each cluster (label).
ctrl_clstr_no_elmnts: this variable helps to control the balance cluster size.
label: result balanced label that is going to be created.

4.For each iteration (each df row):
4.1. We obtain the row index where is the minimum euclidean distance in that row of the df 2. defined as bestcluster and in that index we add 1 to the observation of ctrl_clstr_no_elmnts variable representing adding 1 intended top to k.

4.2.Add if statement to control and balance the size of each cluster based on ctrl_clstr_no_elmnts variable. If in the index bestcluster of the observation of ctrl_clstr_no_elmnts is greater than cardinality_sample then define the variable on index bestcluster of df2. as NA to not affect the next iteration and have only the columns in euclidean_distance where ctrl_clstr_no_elmnts is less than cardinality_sample

balanced k-means label for each observation of the dataset

The kmeans algorithm assigns a label for each observation. However, it is not necessary to have balanced clusters in the amount of observations that is why we use this function.
Kmeans algorithm assign randomly the centroids based on the iterations of nstart. This is why every time this function runs the output would have a different order in the labels.
It is recommendable that the size of the population is divisible by the number of clusters k to have equal number of elements in each cluster.

Eduardo Trujillo

1Edtrujillo1/udeploy documentation built on July 13, 2021, 9:12 p.m.