distantia_cluster_kmeans: K-Means Clustering of Dissimilarity Analysis Data Frames

View source: R/distantia_cluster_kmeans.R

distantia_cluster_kmeansR Documentation

K-Means Clustering of Dissimilarity Analysis Data Frames

Description

This function combines the dissimilarity scores computed by distantia(), the K-means clustering method implemented in stats::kmeans(), and the clustering optimization method implemented in utils_cluster_hclust_optimizer() to help group together time series with similar features.

When clusters = NULL, the function utils_cluster_hclust_optimizer() is run underneath to perform a parallelized grid search to find the number of clusters maximizing the overall silhouette width of the clustering solution (see utils_cluster_silhouette()).

This function supports a parallelization setup via future::plan(), and progress bars provided by the package progressr.

Usage

distantia_cluster_kmeans(df = NULL, clusters = NULL, seed = 1)

Arguments

df

(required, data frame) Output of distantia(), distantia_ls(), distantia_dtw(), or distantia_time_delay(). Default: NULL

clusters

(required, integer) Number of groups to generate. If NULL (default), utils_cluster_kmeans_optimizer() is used to find the number of clusters that maximizes the mean silhouette width of the clustering solution (see utils_cluster_silhouette()). Default: NULL

seed

(optional, integer) Random seed to be used during the K-means computation. Default: 1

Value

list:

  • cluster_object: kmeans object object for further analyses and custom plotting.

  • clusters: integer, number of clusters.

  • silhouette_width: mean silhouette width of the clustering solution.

  • df: data frame with time series names, their cluster label, and their individual silhouette width scores.

  • d: psi distance matrix used for clustering.

  • optimization: only if clusters = NULL, data frame with optimization results from utils_cluster_hclust_optimizer().

See Also

Other distantia_support: distantia_aggregate(), distantia_boxplot(), distantia_cluster_hclust(), distantia_matrix(), distantia_model_frame(), distantia_spatial(), distantia_stats(), distantia_time_delay(), utils_block_size(), utils_cluster_hclust_optimizer(), utils_cluster_kmeans_optimizer(), utils_cluster_silhouette()

Examples


#weekly covid prevalence in California
tsl <- tsl_initialize(
  x = covid_prevalence,
  name_column = "name",
  time_column = "time"
)

#subset 10 elements to accelerate example execution
tsl <- tsl_subset(
  tsl = tsl,
  names = 1:10
)

if(interactive()){
  #plotting first three time series
  tsl_plot(
    tsl = tsl[1:3],
    guide_columns = 3
  )
}

#dissimilarity analysis
distantia_df <- distantia(
  tsl = tsl,
  lock_step = TRUE
)

#hierarchical clustering
#automated number of clusters
distantia_kmeans <- distantia_cluster_kmeans(
  df = distantia_df,
  clusters = NULL
)

#names of the output object
names(distantia_kmeans)

#kmeans object
distantia_kmeans$cluster_object

#distance matrix used for clustering
distantia_kmeans$d

#number of clusters
distantia_kmeans$clusters

#clustering data frame
#group label in column "cluster"
distantia_kmeans$df

#mean silhouette width of the clustering solution
distantia_kmeans$silhouette_width

#kmeans plot
# factoextra::fviz_cluster(
#   object = distantia_kmeans$cluster_object,
#   data = distantia_kmeans$d,
#   repel = TRUE
# )

distantia documentation built on April 4, 2025, 5:42 a.m.