find_optimal_k: Clustering Functions
In ilangurudev/approxmapR: ApproxMAP (APPROXimate Multiple Alignment Pattern)

find_optimal_k

R Documentation

Clustering Functions

Description

A function which allows one to find the optimal K value to be used in a supported clustering algorithm based on a variety of clustering validation measures. Returns a dataframe with information regarding the clusters and a plot of the stated validation measure and K values.

Usage

  find_optimal_k(df_aggregated, clustering = 'k-nn', min_k = 2, max_k = 10,
                              use_cache = TRUE, save_table = TRUE, file_name = NULL , output_directory = "~")

Arguments

`df_aggregated`	A dataframe that has been aggregated using either aggregate_sequences(), or pre_aggregated() from approxmapR. This dataframe will have exactly 3 columns: id, period, and event.
`clustering`	The type of clustering algorithm to be used. Currently, K-Nearest Neighbors (clustering = 'k-nn') and K-Medoids (clustering = 'k-medoids') are supported.
`mink_k`	The starting K value.
`max_k`	The ending K value.
`use_cache`	A boolean value to indicate weather or not to use the cached distance matrix.
`save_table`	Default value is TRUE which will save the table as a CSV file.
`file_name`	Allows user to specify the file name for the table that is being saved, if nothing is specified then a default file name is used.
`output_directory`	The path to where the exports should be placed. This creates a folder with the name of "approxmap_results".

Value

Returns a dataframe with the K value, number of clusters, size of clusters, average silhouette width and it's 95 average distance between clusters, average distance within clusters / average distance between clusters, and the sum of average distance within clusters. Additionally, a plot is returned which show the validation measure of choice and it's corresponding K value.

Examples

  data("mvad")

  mvad %>%
    aggregate_sequences(format = "%Y-%m-%d",
                        unit = "month",
                        n_units = 1,
                        summary_stats = FALSE) %>%
    find_optimal_k(clustering = 'k-nn', min_k = 2, max_k = 10,
                   use_cache = TRUE)

ilangurudev/approxmapR documentation built on March 22, 2022, 1:15 p.m.