test_hier_clusters_exact: Exact significance test for hierarchical clustering

View source: R/trunc_inf.R

test_hier_clusters_exactR Documentation

Exact significance test for hierarchical clustering

Description

This tests the null hypothesis of no difference in means between clusters k1 and k2 at level K in a hierarchical clustering. (The K clusters are numbered as per the results of the cutree function in the stats package.)

Usage

test_hier_clusters_exact(
  X,
  link,
  hcl,
  K,
  k1,
  k2,
  iso = TRUE,
  sig = NULL,
  SigInv = NULL,
  dist = NULL
)

Arguments

X

n by p matrix containing numeric data.

link

String selecting the linkage. Supported options are "single", "average", "centroid", "ward.D", "median", and "mcquitty".

hcl

Object of the type hclust containing the hierarchical clustering of X.

K

Integer selecting the total number of clusters.

k1, k2

Integers selecting the clusters to test, as indexed by the results of cutree(hcl, K).

iso

Boolean. If TRUE, isotropic covariance matrix model, otherwise not.

sig

Optional scalar specifying \sigma, relevant if iso is TRUE.

SigInv

Optional matrix specifying \Sigma^{-1}, relevant if iso is FALSE.

dist

The SQUARED Euclidean distances of matrix X

Details

In order to account for the fact that the clusters have been estimated from the data, the p-values are computed conditional on the fact that those clusters were estimated. This function computes p-values exactly via an analytic characterization of the conditioning set.

Currently, this function supports SQUARED Euclidean distance as a measure of dissimilarity between observations, and the following six linkages: single, average, centroid, Ward, McQuitty (also known as WPGMA), and median (also knßown as WPGMC).

By default, this function assumes that the covariance matrix of the features is isotropic i.e. Cov(X_i) = \sigma^2 I_p. Setting iso to FALSE instead assumes that Cov(X_i) = \Sigma. If known, \sigma can be passed in using the sigma argument or \Sigma^{-1} can be passed in the SigInv argument; otherwise, an estimate of \sigma or \Sigma will be used.

Note that passing in the SQUARED Euclidean distance object used by hclust in using the optional dist argument improves computational efficiency for all linkages except for single linkage. This may not lead to noticeable speed-ups in small data sets but leads to major speed-ups in large data sets. Thank you to Jesko Wagner for suggesting and implementing this change.

Value

stat

the test statistic: the Euclidean distance between the mean of cluster k1 and the mean of cluster k2

pval

the p-value

trunc

object of the type Intervals containing the conditioning set

References

Lucy L. Gao et al. "Selective inference for hierarchical clustering".

See Also

rect_hier_clusters for visualizing clusters k1 and k2 in the dendrogram;

test_complete_hier_clusters_approx for approximate p-values for complete linkage hierarchical clustering;

test_clusters_approx for approximate p-values for a user-specified clustering function;

Examples

# Simulates a 100 x 2 data set with three clusters
set.seed(123)
dat <- rbind(c(-1, 0), c(0, sqrt(3)), c(1, 0))[rep(1:3, length=100), ] + 
matrix(0.2*rnorm(200), 100, 2)

# Average linkage hierarchical clustering
hcl <- hclust(dist(dat, method="euclidean")^2, method="average")

# plot dendrograms with the 1st and 2nd clusters (cut at the third split) 
# displayed in blue and orange 
plot(hcl)
rect_hier_clusters(hcl, k=3, which=1:2, border=c("blue", "orange"))

# tests for a difference in means between the blue and orange clusters
test_hier_clusters_exact(X=dat, link="average", hcl=hcl, K=3, k1=1, k2=2)


lucylgao/clusterpval documentation built on July 4, 2023, 4:40 p.m.