find_optimal_k | R Documentation |
A function which allows one to find the optimal K value to be used in a supported clustering algorithm based on a variety of clustering validation measures. Returns a dataframe with information regarding the clusters and a plot of the stated validation measure and K values.
find_optimal_k(df_aggregated, clustering = 'k-nn', min_k = 2, max_k = 10, use_cache = TRUE, save_table = TRUE, file_name = NULL , output_directory = "~")
df_aggregated |
A dataframe that has been aggregated using either aggregate_sequences(), or pre_aggregated() from approxmapR. This dataframe will have exactly 3 columns: id, period, and event. |
clustering |
The type of clustering algorithm to be used. Currently, K-Nearest Neighbors (clustering = 'k-nn') and K-Medoids (clustering = 'k-medoids') are supported. |
mink_k |
The starting K value. |
max_k |
The ending K value. |
use_cache |
A boolean value to indicate weather or not to use the cached distance matrix. |
save_table |
Default value is TRUE which will save the table as a CSV file. |
file_name |
Allows user to specify the file name for the table that is being saved, if nothing is specified then a default file name is used. |
output_directory |
The path to where the exports should be placed. This creates a folder with the name of "approxmap_results". |
Returns a dataframe with the K value, number of clusters, size of clusters, average silhouette width and it's 95 average distance between clusters, average distance within clusters / average distance between clusters, and the sum of average distance within clusters. Additionally, a plot is returned which show the validation measure of choice and it's corresponding K value.
data("mvad") mvad %>% aggregate_sequences(format = "%Y-%m-%d", unit = "month", n_units = 1, summary_stats = FALSE) %>% find_optimal_k(clustering = 'k-nn', min_k = 2, max_k = 10, use_cache = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.