| test_complete_hier_clusters_approx | R Documentation |
This tests the null hypothesis of no difference in means between
clusters k1 and k2 at level K in a complete
linkage hierarchical clustering. (The K clusters are numbered as per
the results of the cutree function in the stats package.)
test_complete_hier_clusters_approx(
X,
hcl,
K,
k1,
k2,
iso = TRUE,
sig = NULL,
SigInv = NULL,
ndraws = 2000
)
X |
|
hcl |
An object of the type |
K |
Integer selecting the total number of clusters. |
k1, k2 |
Integers selecting the clusters to test. |
iso |
Boolean. If |
sig |
Optional scalar specifying |
SigInv |
Optional matrix specifying |
ndraws |
Integer selecting the number of importance samples, default of 2000. |
Important note: Before calling hclust and this function, make sure to
load the package fastcluster. This is because the p-value approximation
procedure requires running hierarchical clustering on a large number of simulated
data sets, and the version of hclust in the fastcluster package
is much faster than the version of hclust in stats.
In order to account for the fact that the clusters have been estimated from the data, the p-values are computed conditional on the fact that those clusters were estimated. This function approximates p-values via importance sampling.
Currently, this function supports squared Euclidean distance as a measure of dissimilarity between observations. (Note that complete linkage is invariant under monotone transformations of the measure of dissimilarity between observations, so unsquared Euclidean distance would produce the same hierarchical clustering.)
By default, this function assumes that the covariance matrix of the features is isotropic
i.e. Cov(X_i) = \sigma^2 I_p. Setting iso to false instead assumes that
Cov(X_i) = \Sigma. If known, \sigma can be passed in using the sigma argument
or \Sigma^{-1} can be passed in the SigInv argument; otherwise, an
estimate of \sigma or \Sigma will be used.
stat |
the test statistic: the Euclidean distance between the mean of cluster |
pval |
the approximate p-value |
stderr |
estimated standard error of the p-value estimate |
Lucy L. Gao et al. "Selective inference for hierarchical clustering".
rect_hier_clusters for visualizing clusters k1 and k2 in the dendrogram;
test_hier_clusters_exact for exact p-values for hierarchical clustering with other linkages;
test_clusters_approx for approximate p-values for a user-specified clustering function;
# Simulates a 100 x 2 data set with no clusters
set.seed(1)
dat <- matrix(rnorm(200), 100, 2)
# Complete linkage hierarchical clustering
library(fastcluster)
hcl <- hclust(dist(dat, method="euclidean")^2, method="complete")
# plot dendrograms with the 1st and 2nd clusters (cut at the third level)
# displayed in blue and orange
plot(hcl)
rect_hier_clusters(hcl, k=3, which=1:2, border=c("blue", "orange"))
# Monte Carlo test for a difference in means between the blue and orange clusters
test_complete_hier_clusters_approx(X=dat, hcl=hcl, K=3, k1=1, k2=2, ndraws=1000)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.