In JeffJetton/greenclust: Combine Categories Using Greenacre's Method

knitr::opts_chunk$set(
  collapse = TRUE,
  fig.path = "man/figures/README-"
)

greenclust

Count of CRAN downloads

Overview

The greenclust package implements a method of grouping/clustering the categories of a contingency table in a way that preserves as much of the original variance as possible. It is well-suited for reducing the number of levels of a categorical feature in logistic regression (or any other model having a categorical outcome), while still maintaining some degree of explanatory power.

It does this by iteratively collapsing the rows two at a time, similar to other agglomerative hierarchical clustering methods. At each step, it selects the pair of rows whose combination results in a new table with the smallest loss of chi-squared. This process is often refered to "Greenacre's Method", particularly in the SAS community, after statistician Michael J. Greenacre.

The returned object is an extended version of the hclust object used in the stats package and can be used similarly (plotted as a dendrogram, cut, etc.). Additional functions are provided in the package for automatic cutting, diagnostic plotting, and assigning derived clusters back to the source data.

Installation

The latest "official release" of greenclust is now available on CRAN and can be installed like any regular R package. Or you can install the latest development version from this GitHub repository using the devtools package:

# Install from CRAN
install.packages("greenclust")

# Install newest version (potentially still in development) from github
# install.packages("devtools")
devtools::install_github("jeffjetton/greenclust")

Examples

The greenclust() function works like hclust(), only it accepts a contingency table rather than a dissimilarity matrix. For the purposes of this example, we'll merge the categorical features of the Titanic data set into a single, monolithic category.

# Combine Titanic passenger attributes into a single category
tab <- t(as.data.frame(apply(Titanic, 4:1, FUN=sum)))

# Remove rows with all zeros (not valid for chi-squared test)
tab <- tab[apply(tab, 1, sum) > 0, ]

This gives us a contingency table with several levels, showing the total number of passengers who survived (or not) at each level:

tab

From there, we can perform our clustering:

# Create greenclust tree object from table
library(greenclust)
grc <- greenclust(tab)

# Alternatively, to show details of each step:
# grc <- greenclust(tab, verbose=TRUE)

# Result can be plotted like any standard hclust tree
plot(grc)

The "height" in this case is the reduction in r-squared. That is, the proportion of chi-squared, relative to the original uncollapsed table, that is lost when the two categories are combined at each clustering step.

Other Functions

greenplot

The package provides a diagnostic plotting function that shows the r-squared and chi-squared test p-value for each potential number of groups/clusters. This can be a useful tool when weighing the trade-off between fewer clusters and lower r-squared:

greenplot(grc)

When using this method, the customary "optimal" number of groups is found at most-significant chi-squared test (i.e., lowest p-value). This point is automatically highlighted by greenplot().

greencut

greencut() is essentially a version of cutree() that cuts a greenclust tree at the optimal level (mentioned above) by default:

greencut(grc)

Note that greencut() also includes the r-squared and p-value for that particular clustering level as vector attributes. If you want a different cut point, but would still like to have these attributes, you can override automatic selection by specifying either k (number of clusters) or h (height, or 1 - r-squared):

greencut(grc, k=3)

assign.cluster

After clustering, you'll typically want to associate the resulting cluster numbers back to the original data. For example, if we clustered the feed supplements of the chickwts data based on the number of "underweight" chicks and then cut the tree, the resulting vector would have an element for each unique category level, rather than an element for each actual observation:

chick.table <- table(chickwts$feed,
                     ifelse(chickwts$weight < 200, "Y", "N"))
chick.tree <- greenclust(chick.table)
# Use the default cut point
chick.clusters <- greencut(chick.tree)

# The resulting six-element vector shows the cluster number for each level
chick.clusters

assign.cluster() is a simple convenience function for expanding those cluster numbers back out:

chickwts$cluster <- assign.cluster(chickwts$feed, chick.clusters)

# Sample of data with new cluster numbers
chickwts[9:13, ]

# Observation counts by original level and new cluster
print(table(chickwts$feed, chickwts$cluster))