View source: R/sidClustering.rfsrc.R
sidClustering.rfsrc | R Documentation |
Clustering of unsupervised data using SID (Mantero and Ishwaran, 2021). Also implements the artificial two-class approach of Breiman (2003).
## S3 method for class 'rfsrc'
sidClustering(data,
method = "sid",
k = NULL,
reduce = TRUE,
ntree = 500,
ntree.reduce = function(p, vtry){100 * p / vtry},
fast = FALSE,
x.no.sid = NULL,
use.sid.for.x = TRUE,
x.only = NULL, y.only = NULL,
dist.sharpen = TRUE, ...)
data |
A data frame containing the unsupervised data. |
method |
Clustering method. Default is |
k |
Requested number of clusters. Can be a single integer or a vector. If a scalar, returns a vector assigning each observation to a cluster. If a vector, returns a matrix with one column per requested value of |
reduce |
Logical. If |
ntree |
Number of trees used in the main SID clustering analysis. |
ntree.reduce |
Number of trees used in the holdout VIMP step during variable reduction. See |
fast |
Logical. If |
x.no.sid |
Variables to exclude from SID transformation. Can be either a separate data frame (not overlapping with |
use.sid.for.x |
Logical. If |
x.only |
Character vector specifying which variables to use as features. Applies only when |
y.only |
Character vector specifying which variables to use as multivariate responses. Applies only when |
dist.sharpen |
Logical. If |
... |
Additional arguments passed to |
Given an unsupervised dataset, random forests is used to compute a distance matrix measuring dissimilarity between all pairs of observations. By default, hierarchical clustering is applied to this distance matrix, although users may apply any other clustering algorithm. See the examples below for alternative workflows.
The default method, method = "sid"
, implements SID clustering (sidClustering). The algorithm begins by enhancing the original feature space using Staggered Interaction Data (SID). This transformation creates:
SID main features: shifted and staggered versions of the original features that are made strictly positive and mutually non-overlapping in range;
SID interaction features: pairwise multiplicative interactions formed between all SID main features.
A multivariate random forest is trained to predict SID main features using the SID interaction features as predictors. The rationale is that if a feature is informative for distinguishing clusters, it will exhibit systematic variation across the data space. Because each interaction feature is uniquely defined by the features it is formed from, node splits on interaction terms are able to capture and separate such variation, thus effectively identifying the clusters. See Mantero and Ishwaran (2021) for further details.
Since SID includes all pairwise interactions, the dimensionality of the feature space grows quadratically with the number of original variables (or worse when factor variables are present). As such, the reduction step using holdout variable importance (VIMP) is strongly recommended (enabled by default). This step can be disabled using reduce = FALSE
, but only when the original feature space is of manageable size.
A second approach, proposed by Breiman (2003) and refined by Shi and Horvath (2006), transforms the unsupervised task into a two-class supervised classification problem. The first class consists of the original data, while the second class is generated artificially. The goal is to separate real data from synthetic data. A proximity matrix is constructed from this supervised model, and the proximity values for the original class are extracted and converted into a distance matrix (distance = 1 - proximity) for clustering.
Artificial data can be generated using two modes:
mode 1
(default): draws random values from the empirical distribution of each feature;
mode 2
: draws uniformly between the observed minimum and maximum of each feature.
This method is invoked by setting method = "sh"
, "sh1"
, or "sh2"
. Mantero and Ishwaran (2021) found that while this approach works in certain settings, it can fail when clusters exist in lower-dimensional subspaces (e.g., when defined by interactions or involving both factors and continuous variables). Among the two modes, mode 1 is generally more robust.
The third method, method = "unsupv"
, trains a multivariate forest using the data both as predictors and as responses. The multivariate splitting rule is applied at each node. This method is fast and simple but may be less accurate compared to SID clustering.
The package includes a helper function sid.perf.metric
for evaluating clustering performance using a normalized score; smaller values indicate better performance. See Mantero and Ishwaran (2021) for theoretical background and empirical benchmarking.
A list with the following components:
clustering |
A vector or matrix assigning each observation to a cluster. If multiple values of |
rf |
The trained random forest object used in the clustering procedure. This is typically a multivariate forest (for |
dist |
The distance matrix computed from the forest. Used for clustering. For |
sid |
The SID-transformed data used in the clustering (applies only to |
Hemant Ishwaran and Udaya B. Kogalur
Breiman, L. (2003). Manual on setting up, using and understanding random forest, V4.0. University of California Berkeley, Statistics Department, Berkeley.
Mantero A. and Ishwaran H. (2021). Unsupervised random forests. Statistical Analysis and Data Mining, 14(2):144-167.
Shi, T. and Horvath, S. (2006). Unsupervised learning with random forest predictors. Journal of Computational and Graphical Statistics, 15(1):118-138.
rfsrc
,
rfsrc.fast
## ------------------------------------------------------------
## mtcars example
## ------------------------------------------------------------
## default SID method
o1 <- sidClustering(mtcars)
print(split(mtcars, o1$cl[, 10]))
## using artifical class approach
o1.sh <- sidClustering(mtcars, method = "sh")
print(split(mtcars, o1.sh$cl[, 10]))
## ------------------------------------------------------------
## glass data set
## ------------------------------------------------------------
if (library("mlbench", logical.return = TRUE)) {
## this is a supervised problem, so we first strip the class label
data(Glass)
glass <- Glass
y <- Glass$Type
glass$Type <- NULL
## default SID call
o2 <- sidClustering(glass, k = 6)
print(table(y, o2$cl))
print(sid.perf.metric(y, o2$cl))
## compare with Shi-Horvath mode 1
o2.sh <- sidClustering(glass, method = "sh1", k = 6)
print(table(y, o2.sh$cl))
print(sid.perf.metric(y, o2.sh$cl))
## plain-vanilla unsupervised analysis
o2.un <- sidClustering(glass, method = "unsupv", k = 6)
print(table(y, o2.un$cl))
print(sid.perf.metric(y, o2.un$cl))
}
## ------------------------------------------------------------
## vowel data set
## ------------------------------------------------------------
if (library("mlbench", logical.return = TRUE) &&
library("cluster", logical.return = TRUE)) {
## strip the class label
data(Vowel)
vowel <- Vowel
y <- Vowel$Class
vowel$Class <- NULL
## SID
o3 <- sidClustering(vowel, k = 11)
print(table(y, o3$cl))
print(sid.perf.metric(y, o3$cl))
## compare to Shi-Horvath which performs poorly in
## mixed variable settings
o3.sh <- sidClustering(vowel, method = "sh1", k = 11)
print(table(y, o3.sh$cl))
print(sid.perf.metric(y, o3.sh$cl))
## Shi-Horvath improves with PAM clustering
## but still not as good as SID
o3.sh.pam <- pam(o3.sh$dist, k = 11)$clustering
print(table(y, o3.sh.pam))
print(sid.perf.metric(y, o3.sh.pam))
## plain-vanilla unsupervised analysis
o3.un <- sidClustering(vowel, method = "unsupv", k = 11)
print(table(y, o3.un$cl))
print(sid.perf.metric(y, o3.un$cl))
}
## ------------------------------------------------------------
## two-d V-shaped cluster (y=x, y=-x) sitting in 12-dimensions
## illustrates superiority of SID to Breiman/Shi-Horvath
## ------------------------------------------------------------
p <- 10
m <- 250
n <- 2 * m
std <- .2
x <- runif(n, 0, 1)
noise <- matrix(runif(n * p, 0, 1), n)
y <- rep(NA, n)
y[1:m] <- x[1:m] + rnorm(m, sd = std)
y[(m+1):n] <- -x[(m+1):n] + rnorm(m, sd = std)
vclus <- data.frame(clus = c(rep(1, m), rep(2,m)), x = x, y = y, noise)
## SID
o4 <- sidClustering(vclus[, -1], k = 2)
print(table(vclus[, 1], o4$cl))
print(sid.perf.metric(vclus[, 1], o4$cl))
## Shi-Horvath
o4.sh <- sidClustering(vclus[, -1], method = "sh1", k = 2)
print(table(vclus[, 1], o4.sh$cl))
print(sid.perf.metric(vclus[, 1], o4.sh$cl))
## plain-vanilla unsupervised analysis
o4.un <- sidClustering(vclus[, -1], method = "unsupv", k = 2)
print(table(vclus[, 1], o4.un$cl))
print(sid.perf.metric(vclus[, 1], o4.un$cl))
## ------------------------------------------------------------
## two-d V-shaped cluster using fast random forests
## ------------------------------------------------------------
o5 <- sidClustering(vclus[, -1], k = 2, fast = TRUE)
print(table(vclus[, 1], o5$cl))
print(sid.perf.metric(vclus[, 1], o5$cl))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.