find_optimal_k: Clustering Functions

View source: R/4_cluster.R

find_optimal_kR Documentation

Clustering Functions

Description

A function which allows one to find the optimal K value to be used in a supported clustering algorithm based on a variety of clustering validation measures. Returns a dataframe with information regarding the clusters and a plot of the stated validation measure and K values.

Usage

  find_optimal_k(df_aggregated, clustering = 'k-nn', min_k = 2, max_k = 10,
                              use_cache = TRUE, save_table = TRUE, file_name = NULL , output_directory = "~")

Arguments

df_aggregated

A dataframe that has been aggregated using either aggregate_sequences(), or pre_aggregated() from approxmapR. This dataframe will have exactly 3 columns: id, period, and event.

clustering

The type of clustering algorithm to be used. Currently, K-Nearest Neighbors (clustering = 'k-nn') and K-Medoids (clustering = 'k-medoids') are supported.

mink_k

The starting K value.

max_k

The ending K value.

use_cache

A boolean value to indicate weather or not to use the cached distance matrix.

save_table

Default value is TRUE which will save the table as a CSV file.

file_name

Allows user to specify the file name for the table that is being saved, if nothing is specified then a default file name is used.

output_directory

The path to where the exports should be placed. This creates a folder with the name of "approxmap_results".

Value

Returns a dataframe with the K value, number of clusters, size of clusters, average silhouette width and it's 95 average distance between clusters, average distance within clusters / average distance between clusters, and the sum of average distance within clusters. Additionally, a plot is returned which show the validation measure of choice and it's corresponding K value.

Examples

  data("mvad")

  mvad %>%
    aggregate_sequences(format = "%Y-%m-%d",
                        unit = "month",
                        n_units = 1,
                        summary_stats = FALSE) %>%
    find_optimal_k(clustering = 'k-nn', min_k = 2, max_k = 10,
                   use_cache = TRUE)

ilangurudev/approxmapR documentation built on March 22, 2022, 1:15 p.m.