kmean_samesize: kmean_samesize

Description Usage Arguments Details Value Note Author(s)

Description

Create optimal k-mean labels for a dataset.

Usage

1
kmean_samesize(df, k, nstart = 25)

Arguments

df

dataset to generate k-means labels

k

number of desire clusters

nstart

number of k-mean iterations for the centroids creation.

Details

  1. Calculate the k-means algorithm on numeric or integar variables from the dataset df.

  2. Calculate the euclidean distance between the row df points and each centroid.

  1. 3.variables used:

  1. 4.For each iteration (each df row):

  2. 4.1. We obtain the row index where is the minimum euclidean distance in that row of the df 2. defined as bestcluster and in that index we add 1 to the observation of ctrl_clstr_no_elmnts variable representing adding 1 intended top to k.

  1. 4.2.Add if statement to control and balance the size of each cluster based on ctrl_clstr_no_elmnts variable. If in the index bestcluster of the observation of ctrl_clstr_no_elmnts is greater than cardinality_sample then define the variable on index bestcluster of df2. as NA to not affect the next iteration and have only the columns in euclidean_distance where ctrl_clstr_no_elmnts is less than cardinality_sample

Value

balanced k-means label for each observation of the dataset

Note

Author(s)

Eduardo Trujillo


1Edtrujillo1/udeploy documentation built on July 13, 2021, 9:12 p.m.