clusterKmeans | R Documentation |
This function lets the user cluster a whole data.frame automatically. As you might know, the goal of kmeans is to group data points into distinct non-overlapping subgroups. If needed, one hot encoding will be applied to categorical values automatically with this function. For consideration: Scale/standardize the data when applying kmeans. Also, kmeans assumes spherical shapes of clusters and does not work well when clusters are in different shapes such as elliptical clusters.
clusterKmeans(
df,
k = NULL,
wss_var = 0,
limit = 15,
drop_na = TRUE,
ignore = NULL,
ohse = TRUE,
norm = TRUE,
algorithm = c("Hartigan-Wong", "Lloyd", "Forgy", "MacQueen"),
dim_red = "PCA",
comb = c(1, 2),
seed = 123,
quiet = FALSE,
...
)
df |
Dataframe |
k |
Integer. Number of clusters |
wss_var |
Numeric. Used to pick automatic |
limit |
Integer. How many clusters should be considered? |
drop_na |
Boolean. Should NA rows be removed? |
ignore |
Character vector. Names of columns to ignore. |
ohse |
Boolean. Do you wish to automatically run one hot encoding to non-numerical columns? |
norm |
Boolean. Should the data be normalized? |
algorithm |
character: may be abbreviated. Note that
|
dim_red |
Character. Select dimensionality reduction technique.
Pass any of: |
comb |
Vector. Which columns do you wish to plot? Select which two variables by name or column position. |
seed |
Numeric. Seed for reproducibility |
quiet |
Boolean. Keep quiet? If not, print messages. |
... |
Additional parameters to pass sub-functions. |
List. If no k
is provided, contains nclusters
and
nclusters_plot
to determine optimal k
given their WSS (Within
Groups Sum of Squares). If k
is provided, additionally we get:
df
data.frame with original df
plus cluster
column
clusters
integer which is the same as k
fit
kmeans object used to fit clusters
means
data.frame with means and counts for each cluster
correlations
plot with correlations grouped by clusters
PCA
list with PCA results (when dim_red="PCA"
)
tSNE
list with t-SNE results (when dim_red="tSNE"
)
Other Clusters:
clusterOptimalK()
,
clusterVisualK()
,
reduce_pca()
,
reduce_tsne()
Sys.unsetenv("LARES_FONT") # Temporal
data("iris")
df <- subset(iris, select = c(-Species))
# If dataset has +5 columns, feel free to reduce dimenstionalities
# with reduce_pca() or reduce_tsne() first
# Find optimal k
check_k <- clusterKmeans(df, limit = 10)
check_k$nclusters_plot
# Or pick k automatically based on WSS variance
check_k <- clusterKmeans(df, wss_var = 0.05, limit = 10)
# You can also use our other functions:
# clusterOptimalK(df) and clusterVisualK(df)
# Run with selected k
clusters <- clusterKmeans(df, k = 3)
names(clusters)
# Cross-Correlations for each cluster
plot(clusters$correlations)
# PCA Results (when dim_red = "PCA")
plot(clusters$PCA$plot_explained)
plot(clusters$PCA$plot)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.