knitr::opts_chunk$set(echo = TRUE)

Getting started

The first thing that needs to be completed is installing ApproxmapR if it is not already done.

Installing ApproxmapR

At this time ApproxmapR is not on CRAN and must be installed using the GitHub version. This can be done with the following code:

library(devtools)
devtools::install_github("pinformatics/approxmapR")

The above code will install the latest version.

Loading Library

Now to load the required libraries.

library(approxmapR)
library(kableExtra)
library(knitr)
options(knitr.table.format = "html")
library(tidyverse)

Loading Data Frame

ApproxmapR comes with a few example data sets that can be loaded. For this example, the "demo1" data set will be used. The data needs to have minor manipulation to get it ready.

data("demo1")
demo1 <- data.frame(do.call("rbind", strsplit(as.character(demo1$id.date.item), ",")))
names(demo1) <- c("id", "period", "event")

head(demo1)

The Process

As can be seen above, the data set has 3 columns: "id", "period", and "event". The "id" column is an identifier, the "period" is a date variable which represents the period (a grouping variable), and "event" which is the event that occurred. These three column names and order are specific and required.

Aggregating the Data

There are two functions available to aggregate the data, pre_aggregated() and aggregate_sequences(). The pre_aggregated() function is used on data were the "period" column is a single identifier (ex. A, B, C, etc) while the aggregate_sequences() function is used on datetime or date data and will aggregate by "day", "week", "month", "6 months", and "year".

agg_df <-  demo1 %>% aggregate_sequences(format = "%m/%d/%Y",
                                         unit = "month",
                                         n_units = 1,
                                         summary_stats = FALSE)

head(agg_df)

For documentation use ??aggregate_sequences() or ??pre_aggregated(). The -n_units- is specifying how many -units- to be considered within the period. The above indicates the events are to be aggregated based on the month and to considered 1 month as a period. If -n_units- were to equal 2, then that would indicate to aggregate based on the month and to considered 2 months as a period.

Finding Optimal K Value

This step is not required but is highly encouraged. By finding the optimal K value, one can proceed with a piece of mind knowing that the best clustering has been achieved. Currently two clustering algorithms are supported, K-NN and K-Medoids.

The -find_optimal_k()- function will output a data frame with all the tested k values, the number of clusters produced, the size of each cluster, all of the supported cluster validation measures, and a plot of K values by validation measure. For more information use ??find_optimal_k.

A special note about the -use_cache- parameter within the -find_optimal_k()- function. This parameter will indicate that the distance matrix calculated should be one that has already been calculated; if it does not yet exist then the distance matrix will be calculated and stored as an environment variable. For those interested, this environment variable can be accessed by using:

env_dm$distance_matrix

This feature is beneficial when use massive data sets as distance calculation is expensive in terms of processing.

Example: Optimal K using K-Medoids

ktable <- agg_df %>% find_optimal_k(clustering = 'k-medoids',
                                    min_k = 2, max_k = 8,
                                    save_table = TRUE,
                                    use_cache = TRUE)

One can either view the ktable object and manually search for the optimal value based on the preferred cluster validation measure, or one can visualize the table. By default, the -plot_ktable()- function will plot the silhouette index and save the graph to the default location; if desired, one can specify where the graph saves to by using the output_directory parameter.

plot_ktable(ktable)

Based on the average silhouette measure, K = 6 should be used for the K-Medoids algorithm. We can further assess the fit of clusters, if using the silhouette index, by taking a look at the silhouette object which is stored within the ktable object. Additionally, we can visualize this graphically as well.

In this example, to review the silhouette object use with the optimal K use:

ktable$silhouette_object[[5]]

By visualizing the information graphically, the function produces a silhouette plot for the individual sequence as well as the cluster. For more information on this function use ??plot_silhouette.

plot_silhouette(ktable$silhouette_object[[5]])

Clustering

Currently there are two clustering algorithms supported, K-NN and K-Medoids, and each have their own separate clustering function. To run the K-NN clustering algorithm use -cluster_knn()-; to run the K-Medoids algorithm use -cluster_kmedoids()-.

clustered_kmed <- agg_df %>% cluster_kmedoids(k = 6, use_cache = TRUE)

The Clustered Data Frame

The resulting data frame will show 3 columns: "cluster", "df_sequences", and "n". Cluster is the cluster number, df_sequences contains a data frame which has the original id and the sequence resulting from that id, and n is the number of sequences within the cluster.

clustered_kmed

Reading the first row one sees that it is cluster #5 which has 3 sequences within the cluster. Let's take a quick look at the df_sequences column for more intuition.

clustered_kmed$df_sequences[[1]]

Now to take a look at the sequences themselves.

clustered_kmed$df_sequences[[1]]$sequence

For a bit more intuition on the sequences themselves a closer look at the period and events where id = 5.

agg_df %>% filter(id == 5)

Notice how events B, I, and J are within the same set - as well as events K and L are in the same set, but different than B, I, and J. This is because events B, I, and J occurred during the first period, while events K and L occurred during a different period.

Pulling the Pattern

Next step is to pull the pattern, this is completed by specifying a strength threshold. This pulls the events within a weighted sequence that is shared by the majority of sequences within a cluster; for more information please see Kum, H. C., Pei, J, Wang, W., and Duncan, D. (2002). ApproxMAP: Approximate Mining of Consensus Sequential Patterns. Technical Report TR02-031, UNC-CH, 2002. Published in Mining Sequential Patterns from Large Data Sets. Springer Series: The Kluwer International Series on Advances in Database Systems, Vol. 28. Wang, W. & Yang, J. pp. 138-160.

patterns_kmed <- clustered_kmed %>%
                    filter_pattern(threshold = 0.3,
                                   pattern_name = 'consensus')

Generating Reports

The final step is to generate a report based on the threshold.

patterns_kmed %>% generate_reports(end_filename_with = "_kmed")


ilangurudev/approxmapR documentation built on March 22, 2022, 1:15 p.m.