test_complete_hier_clusters_approx | R Documentation |
This tests the null hypothesis of no difference in means between
clusters k1
and k2
at level K
in a complete
linkage hierarchical clustering. (The K
clusters are numbered as per
the results of the cutree
function in the stats
package.)
test_complete_hier_clusters_approx(
X,
hcl,
K,
k1,
k2,
iso = TRUE,
sig = NULL,
SigInv = NULL,
ndraws = 2000
)
X |
|
hcl |
An object of the type |
K |
Integer selecting the total number of clusters. |
k1, k2 |
Integers selecting the clusters to test. |
iso |
Boolean. If |
sig |
Optional scalar specifying |
SigInv |
Optional matrix specifying |
ndraws |
Integer selecting the number of importance samples, default of 2000. |
Important note: Before calling hclust
and this function, make sure to
load the package fastcluster
. This is because the p-value approximation
procedure requires running hierarchical clustering on a large number of simulated
data sets, and the version of hclust
in the fastcluster
package
is much faster than the version of hclust
in stats
.
In order to account for the fact that the clusters have been estimated from the data, the p-values are computed conditional on the fact that those clusters were estimated. This function approximates p-values via importance sampling.
Currently, this function supports squared Euclidean distance as a measure of dissimilarity between observations. (Note that complete linkage is invariant under monotone transformations of the measure of dissimilarity between observations, so unsquared Euclidean distance would produce the same hierarchical clustering.)
By default, this function assumes that the covariance matrix of the features is isotropic
i.e. Cov(X_i) = \sigma^2 I_p
. Setting iso
to false instead assumes that
Cov(X_i) = \Sigma
. If known, \sigma
can be passed in using the sigma
argument
or \Sigma^{-1}
can be passed in the SigInv
argument; otherwise, an
estimate of \sigma
or \Sigma
will be used.
stat |
the test statistic: the Euclidean distance between the mean of cluster |
pval |
the approximate p-value |
stderr |
estimated standard error of the p-value estimate |
Lucy L. Gao et al. "Selective inference for hierarchical clustering".
rect_hier_clusters
for visualizing clusters k1
and k2
in the dendrogram;
test_hier_clusters_exact
for exact p-values for hierarchical clustering with other linkages;
test_clusters_approx
for approximate p-values for a user-specified clustering function;
# Simulates a 100 x 2 data set with no clusters
set.seed(1)
dat <- matrix(rnorm(200), 100, 2)
# Complete linkage hierarchical clustering
library(fastcluster)
hcl <- hclust(dist(dat, method="euclidean")^2, method="complete")
# plot dendrograms with the 1st and 2nd clusters (cut at the third level)
# displayed in blue and orange
plot(hcl)
rect_hier_clusters(hcl, k=3, which=1:2, border=c("blue", "orange"))
# Monte Carlo test for a difference in means between the blue and orange clusters
test_complete_hier_clusters_approx(X=dat, hcl=hcl, K=3, k1=1, k2=2, ndraws=1000)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.