Short R Tutorial: Validating Sequence Analysis Typologies To be Used in Subsequent Regression'
In WeightedCluster: Clustering of Weighted Data

Introduction

knitr::knit_hooks$set(time_it = local({
  now <- NULL
  function(before, options) {
    if (before) {
      # record the current time before each chunk
      now <<- Sys.time()
    } else {
      # calculate the time difference after a chunk
      res <- difftime(Sys.time(), now, units = "secs")
      # return a character string to show the time
      paste("Time for this code chunk to run:", round(res,
        2), "seconds")
    }
  }
}))
knitr::opts_chunk$set(dev = "png", dev.args = list(type = "cairo-png"), time_it=TRUE)

This document provides a very quick introduction to the R code needed to estimate the quality of a typology in a subsequent regression or when the relationship between the typology and the covariate is of key interest. Readers interested in the methods and the exact interpretation of the results are referred to:

Unterlerchner, L., Studer, M. & Gomensoro, A. (2023). Back to the Features. Investigating the Relationship Between Educational Pathways and Income Using Sequence Analysis and Feature Extraction and Selection Approach. Swiss Journal of Sociology, 49(2), 417–446. https://doi.org/10.2478/sjs-2023-0021

You are kindly asked to cite the above reference if you use the methods presented in this document.

Let's start by setting the seed for reproducible results.

set.seed(1)

Creating the State Sequence Object

For this example, we use the mvad dataset. Let's start with the creation of the state sequence object.

## Loading the TraMineR library
library(TraMineR)
## Loading the data
data(mvad)

## State properties
mvad.alphabet <- c("employment", "FE", "HE", "joblessness", "school", "training")
mvad.lab <- c("employment", "further education", "higher education", "joblessness", "school", "training")
mvad.shortlab <- c("EM","FE","HE","JL","SC","TR")

## Creating the state sequence object
mvad.seq <- seqdef(mvad, 17:86, alphabet = mvad.alphabet, states = mvad.shortlab, labels = mvad.lab, xtstep = 6)

Creating the typology

We will now create a typology using cluster analysis. Readers interested in more detail are referred to the WeightedCluster library manual (also available as a vignette), which goes into the details of the creation and computation of cluster quality measures.

We start by computing dissimilarities with the seqdist function using the Hamming distance. We then use Ward clustering to create a typology of the trajectories. For this step, we recommend the use of the fastcluster library [@R-fastcluster], which considerably speed up the computations.

## Using fastcluster for hierarchical clustering
library(fastcluster)
## Distance computation
diss <- seqdist(mvad.seq, method="LCS")
## Hierarchical clustering
hc <- hclust(as.dist(diss), method="ward.D")

We can now compute several cluster quality indices using as.clustrange function from two to ten groups.

# Loading the WeightedCluster library
library(WeightedCluster)
# Computing cluster quality measures.
clustqual <- as.clustrange(hc, diss=diss, ncluster=10)
clustqual

The `clustassoc` function

In this example, we will focus on the association between father unemployment status (funemp variable) and our school-to-work trajectories. The clustassoc function provides several indicators of the quality of typology to study this association.

The function takes a clustrange object as the first argument. The diss argument specifies the distance matrix used for clustering, covar the covariate of association of interest, and weights an optional case weights vector.

cla <- clustassoc(clustqual, diss=diss, covar=mvad$funemp)
cla

The resulting object presents three indicators. The Unaccounted column shows the share of the direct association between the trajectories and the covariates that is \alert{not accounted for} by the typology. This computation are based on the discrepancy analysis framework [@StuderRitschardGabadinhoMuller2011SMR]. A low value means that the typology carries most of the information that is relevant to study the association between our covariate and the trajectories.

The Remaining column presents the share of the overall variability of the trajectories that is \alert{not accounted for} by the typology. A low value indicates that there is no variation left not explained by the typology. Warning, this is usually a very low value. The value presented in the "No clustering" row (the first) is equivalent to the pseudo-$R^2$ of a discrepancy analysis between the trajectories and the covariates.

The BIC column presents the Bayesian Information Criterion for the association between the typology and the covariate (again the lower the better). While the first column provides the most reliable information, the BIC might be useful when parcimony is of key interest.

The general idea is to select a cluster solution with low values on Unaccounted and BIC (only if relevant).

The results can be plotted to make it easier to find the minimum.

plot(cla, main="Unaccounted")

According to the plot, at least 6 groups are required. However, around one fifth of the association is left un-reproduced by the clustering. It might be interesting to compare the 5 and 6 clusters solutions to better understand the association.

seqdplot(mvad.seq, group=clustqual$clustering$cluster5, border=NA)

seqdplot(mvad.seq, group=clustqual$clustering$cluster6, border=NA)

We can notice that the 6 cluster solutions contains a new joblessness cluster, which is found to be important to study the association between father unemployment and son school-to-work trajectories.

The presented might lead to different recommendations than the usual cluster quality indices, because it focuses on a relationship with a covariate. The method often suggests a higher number of groups.