knitr::opts_chunk$set( collapse = TRUE, fig.path = "man/figures/README-" )
The greenclust
package implements a method of grouping/clustering the categories of a contingency table in a way that preserves as much of the original variance as possible. It is well-suited for reducing the number of levels of a categorical feature in logistic regression (or any other model having a categorical outcome), while still maintaining some degree of explanatory power.
It does this by iteratively collapsing the rows two at a time, similar to other agglomerative hierarchical clustering methods. At each step, it selects the pair of rows whose combination results in a new table with the smallest loss of chi-squared. This process is often refered to "Greenacre's Method", particularly in the SAS community, after statistician Michael J. Greenacre.
The returned object is an extended version of the hclust
object used in the stats
package and can be used similarly (plotted as a dendrogram, cut, etc.). Additional functions are provided in the package for automatic cutting, diagnostic plotting, and assigning derived clusters back to the source data.
The latest "official release" of greenclust
is now available on CRAN and can be installed like any regular R package. Or you can install the latest development version from this GitHub repository using the devtools package:
# Install from CRAN install.packages("greenclust") # Install newest version (potentially still in development) from github # install.packages("devtools") devtools::install_github("jeffjetton/greenclust")
The greenclust()
function works like hclust()
, only it accepts a contingency table rather than a dissimilarity matrix. For the purposes of this example, we'll merge the categorical features of the Titanic data set into a single, monolithic category.
# Combine Titanic passenger attributes into a single category tab <- t(as.data.frame(apply(Titanic, 4:1, FUN=sum))) # Remove rows with all zeros (not valid for chi-squared test) tab <- tab[apply(tab, 1, sum) > 0, ]
This gives us a contingency table with several levels, showing the total number of passengers who survived (or not) at each level:
tab
From there, we can perform our clustering:
# Create greenclust tree object from table library(greenclust) grc <- greenclust(tab) # Alternatively, to show details of each step: # grc <- greenclust(tab, verbose=TRUE) # Result can be plotted like any standard hclust tree plot(grc)
The "height" in this case is the reduction in r-squared. That is, the proportion of chi-squared, relative to the original uncollapsed table, that is lost when the two categories are combined at each clustering step.
The package provides a diagnostic plotting function that shows the r-squared and chi-squared test p-value for each potential number of groups/clusters. This can be a useful tool when weighing the trade-off between fewer clusters and lower r-squared:
greenplot(grc)
When using this method, the customary "optimal" number of groups is found at most-significant chi-squared test (i.e., lowest p-value). This point is automatically highlighted by greenplot()
.
greencut()
is essentially a version of cutree()
that cuts a greenclust tree at the optimal level (mentioned above) by default:
greencut(grc)
Note that greencut()
also includes the r-squared and p-value for that particular clustering level as vector attributes. If you want a different cut point, but would still like to have these attributes, you can override automatic selection by specifying either k
(number of clusters) or h
(height, or 1 - r-squared):
greencut(grc, k=3)
After clustering, you'll typically want to associate the resulting cluster numbers back to the original data. For example, if we clustered the feed supplements of the chickwts data based on the number of "underweight" chicks and then cut the tree, the resulting vector would have an element for each unique category level, rather than an element for each actual observation:
chick.table <- table(chickwts$feed, ifelse(chickwts$weight < 200, "Y", "N")) chick.tree <- greenclust(chick.table) # Use the default cut point chick.clusters <- greencut(chick.tree) # The resulting six-element vector shows the cluster number for each level chick.clusters
assign.cluster()
is a simple convenience function for expanding those cluster numbers back out:
chickwts$cluster <- assign.cluster(chickwts$feed, chick.clusters) # Sample of data with new cluster numbers chickwts[9:13, ] # Observation counts by original level and new cluster print(table(chickwts$feed, chickwts$cluster))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.