test_hier_clusters_exact | R Documentation |
This tests the null hypothesis of no difference in means between
clusters k1
and k2
at level K
in a hierarchical clustering.
(The K
clusters are numbered as per the results of the cutree
function in the stats
package.)
test_hier_clusters_exact(
X,
link,
hcl,
K,
k1,
k2,
iso = TRUE,
sig = NULL,
SigInv = NULL,
dist = NULL
)
X |
|
link |
String selecting the linkage. Supported options are
|
hcl |
Object of the type |
K |
Integer selecting the total number of clusters. |
k1, k2 |
Integers selecting the clusters to test, as indexed by the results of |
iso |
Boolean. If |
sig |
Optional scalar specifying |
SigInv |
Optional matrix specifying |
dist |
The SQUARED Euclidean distances of matrix X |
In order to account for the fact that the clusters have been estimated from the data, the p-values are computed conditional on the fact that those clusters were estimated. This function computes p-values exactly via an analytic characterization of the conditioning set.
Currently, this function supports SQUARED Euclidean distance as a measure of dissimilarity between observations, and the following six linkages: single, average, centroid, Ward, McQuitty (also known as WPGMA), and median (also knßown as WPGMC).
By default, this function assumes that the covariance matrix of the features is isotropic
i.e. Cov(X_i) = \sigma^2 I_p
. Setting iso
to FALSE
instead assumes that
Cov(X_i) = \Sigma
. If known, \sigma
can be passed in using the sigma
argument
or \Sigma^{-1}
can be passed in the SigInv
argument; otherwise, an
estimate of \sigma
or \Sigma
will be used.
Note that passing in the SQUARED Euclidean distance object used by hclust
in using the
optional dist
argument improves computational efficiency for all linkages except
for single linkage. This may not lead to noticeable speed-ups in small data sets but
leads to major speed-ups in large data sets. Thank you to Jesko Wagner for
suggesting and implementing this change.
stat |
the test statistic: the Euclidean distance between the mean of cluster |
pval |
the p-value |
trunc |
object of the type |
Lucy L. Gao et al. "Selective inference for hierarchical clustering".
rect_hier_clusters
for visualizing clusters k1
and k2
in the dendrogram;
test_complete_hier_clusters_approx
for approximate p-values for complete linkage hierarchical clustering;
test_clusters_approx
for approximate p-values for a user-specified clustering function;
# Simulates a 100 x 2 data set with three clusters
set.seed(123)
dat <- rbind(c(-1, 0), c(0, sqrt(3)), c(1, 0))[rep(1:3, length=100), ] +
matrix(0.2*rnorm(200), 100, 2)
# Average linkage hierarchical clustering
hcl <- hclust(dist(dat, method="euclidean")^2, method="average")
# plot dendrograms with the 1st and 2nd clusters (cut at the third split)
# displayed in blue and orange
plot(hcl)
rect_hier_clusters(hcl, k=3, which=1:2, border=c("blue", "orange"))
# tests for a difference in means between the blue and orange clusters
test_hier_clusters_exact(X=dat, link="average", hcl=hcl, K=3, k1=1, k2=2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.