library(kableExtra) library(knitr) options(knitr.table.format = "html")
Approxmap is an algorithm used for exploratory data analysis of sequential data. When one has longitudinal data and wants to find out the underlying patterns, one can use approxmap. approxmapR
aims to provide a consistent and tidy API for using the algorithm in R. This vignette aims to demonstrate the basic workflow of approxmapR.
Installation is simple.
install.packages("devtools") devtools::install_github("pinformatics/approxmapR")
To load the package use,
library(approxmapR)
Though it is not required, it is strongly encouraged to use the tidyverse
package as well. The approxmapR
was designed with the same paradigm and hence works cohesively with tidyverse
. To install and load,
install.packages("tidyverse")
To load,
library(tidyverse)
Now that everything is installed and loaded, let's jump into the problem and the motivation for using approxmap. Let's say you have a dataset that looks like this:
data("mvad") mvad
n_people <- length(unique(mvad$id))
The base data is from the package TraMineR
(it has been tweaked a little for our problem) and provides the employment status of r n_people
individuals for each month from 1993 to 1998. Now you are interested in answering the question, "What are the general sequences of employment that people go through?". Since there are r n_people
people, it is not possible for us to decipher what the patterns are by visual inspection. Looking for exact sequences that are present in all these people will get us nowhere.
So we need to methodically formulate the solution. Let's look at some key terms necessary for doing that:
Hence, every person (id) in our dataset represents a sequence. We therefore have r n_people
sequences. Though we can extract a general pattern from all these sequences, a better idea would be to
The above steps are the crux of the approxmap algorithm.
Let us go through the sequence of steps involved in analyzing the mvad dataset using approxmap. Please note that this version of approxmap only supports unique items within an itemset.
?function_name
to get detailed help.%>%
operator (Ctrl/Cmd + Shift + M). This means you can move from one function to another seamlessly.For the algorithm to create sequences, it needs data in the form:
data("pre_agg") pre_agg
So basically we need to aggregate the dataset i.e. go from dates to aggregations (called as period) in the package. For this we use the aggregate_sequences()
function. The aggregate sequences takes in a number of parameters. format
is used to specify the date format, unit
is used to specify the unit of aggregation - day, week, month and so on, n_units
is used to specify the number of units to aggregate. So if unit is "week" and n_units is 4, 4 weeks becomes the unit of aggregation. For more information please refer to the function documentation.
The function also displays some useful statistics about the sequences.
mvad %>% aggregate_sequences(format = "%Y-%m-%d", unit = "month", n_units = 1)
approxmapR
also allows for pre-aggregated data through the pre_aggregated()
function. This function ensures all the right classes are aplied to the data before moving on to the next steps.
pre_aggregated_df %>% pre_aggregated()
The next step involves one of the more computationally intensive steps of the algorithm - clustering. To cluster we simply need to pass an aggregated dataframe and the k
parameter, which refers to the number of nearest neighbours to consider while clustering. In essense, lower the k
value, higher the number of clusters and vice-versa. Selecting the right value of k is a judgement call that is very specific to the data.
Since it is a heavy task, we have used caching to store the results. What caching does is compares the aggregated dataframe to the one in memory and if it is identical, then uses the previously computed results to cluster. For turning caching off, use the parameter use_cache=FALSE
mvad %>% aggregate_sequences(format = "%Y-%m-%d", unit = "month", n_units = 3, summary_stats=FALSE) %>% cluster_knn(k = 15)
The output of the cluster_knn function is a dataframe with 3 columns -
Now that we have clustered, the next step is to calculate a weighted sequence for each cluster. We can do this using the get_weighted_sequence()
function. However, the filter_pattern()
function automatically does this for us. So all we need to do is call the filter_pattern()
with the required threshold and an optional pattern name (default is consensus).
The threshold
parameter is used to specify the specify the proportion of sequeneces the item must have been present in.
mvad %>% aggregate_sequences(format = "%Y-%m-%d", unit = "month", n_units = 1, summary_stats=FALSE) %>% cluster_knn(k = 15) %>% filter_pattern(threshold = 0.3, pattern_name = "variation")
We can also chain multiple filter_pattern()
functions to keep adding patterns.
results <- mvad %>% aggregate_sequences(format = "%Y-%m-%d", unit = "month", n_units = 3, summary_stats=FALSE) %>% cluster_knn(k = 15) %>% filter_pattern(threshold = 0.3, pattern_name = "variation") %>% filter_pattern(threshold = 0.4, pattern_name = "consensus")
Though all the algorithmic work is done, the output is hardly readable as the objects of interest are all present as classes. We have not "prettified" the output by design because concealing it would really inhibit additional explaratory possibilities. Instead, we have a simple function that can be called for this - format_sequence()
results %>% format_sequence()
Since markdown by default limits the screen content the important output gets truncated. So I have used another paramter called kable
which can be safely ignored if it doesn't make sense.
The format_sequence also has a parameter called compare
which when TRUE lists the patterns within a cluster row-by-row. The r function View()
can be chained to opened the output dataframe in the built-in viewer but sometimes the output strings are too large to be viewed there. So we can chain the readr function write_csv()
to save the output and explore the results in a text editor or excel.
(approxmap_results <- mvad %>% aggregate_sequences(format = "%Y-%m-%d", unit = "month", n_units = 3, summary_stats=FALSE) %>% cluster_knn(k = 15) %>% filter_pattern(threshold = 0.3, pattern_name = "variation") %>% filter_pattern(threshold = 0.4, pattern_name = "consensus") %>% format_sequence(compare=TRUE))%>% kable()
approxmap_results %>% write_csv("approxmap_results.csv")
tidyverse
to fully exploit approxmapRThe output is what is called as a tibble (a supercharged dataframe) that makes it possible to do things like storing a list of tibbles (df_sequences) in a column. To inspect the say the first 2 rows, we can use standard dplyr
commands.
df_sequences <- mvad %>% aggregate_sequences(format = "%Y-%m-%d", unit = "month", n_units = 1, summary_stats=FALSE) %>% cluster_knn(k = 15) %>% top_n(2) %>% pull(df_sequences) df_sequences
To explore these sequences, we also have tidy print methods. The functional programming toolkit for R, purrr
provides an efficient means to fully exploit such outputs.
df_sequences %>% map(function(df_cluster){ df_cluster %>% mutate(sequence = map_chr(sequence, format_sequence)) })
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.