test_complete_hier_clusters_approx: Monte Carlo significance test for complete linkage...

View source: R/trunc_inf.R

test_complete_hier_clusters_approxR Documentation

Monte Carlo significance test for complete linkage hierarchical clustering

Description

This tests the null hypothesis of no difference in means between clusters k1 and k2 at level K in a complete linkage hierarchical clustering. (The K clusters are numbered as per the results of the cutree function in the stats package.)

Usage

test_complete_hier_clusters_approx(
  X,
  hcl,
  K,
  k1,
  k2,
  iso = TRUE,
  sig = NULL,
  SigInv = NULL,
  ndraws = 2000
)

Arguments

X

n by p matrix containing numeric data.

hcl

An object of the type hclust containing the hierarchical clustering of X.

K

Integer selecting the total number of clusters.

k1, k2

Integers selecting the clusters to test.

iso

Boolean. If TRUE, isotropic covariance matrix model, otherwise not.

sig

Optional scalar specifying \sigma, relevant if iso is TRUE.

SigInv

Optional matrix specifying \Sigma^{-1}, relevant if iso is FALSE.

ndraws

Integer selecting the number of importance samples, default of 2000.

Details

Important note: Before calling hclust and this function, make sure to load the package fastcluster. This is because the p-value approximation procedure requires running hierarchical clustering on a large number of simulated data sets, and the version of hclust in the fastcluster package is much faster than the version of hclust in stats.

In order to account for the fact that the clusters have been estimated from the data, the p-values are computed conditional on the fact that those clusters were estimated. This function approximates p-values via importance sampling.

Currently, this function supports squared Euclidean distance as a measure of dissimilarity between observations. (Note that complete linkage is invariant under monotone transformations of the measure of dissimilarity between observations, so unsquared Euclidean distance would produce the same hierarchical clustering.)

By default, this function assumes that the covariance matrix of the features is isotropic i.e. Cov(X_i) = \sigma^2 I_p. Setting iso to false instead assumes that Cov(X_i) = \Sigma. If known, \sigma can be passed in using the sigma argument or \Sigma^{-1} can be passed in the SigInv argument; otherwise, an estimate of \sigma or \Sigma will be used.

Value

stat

the test statistic: the Euclidean distance between the mean of cluster k1 and the mean of cluster k2

pval

the approximate p-value

stderr

estimated standard error of the p-value estimate

References

Lucy L. Gao et al. "Selective inference for hierarchical clustering".

See Also

rect_hier_clusters for visualizing clusters k1 and k2 in the dendrogram;

test_hier_clusters_exact for exact p-values for hierarchical clustering with other linkages;

test_clusters_approx for approximate p-values for a user-specified clustering function;

Examples

# Simulates a 100 x 2 data set with no clusters
set.seed(1)
dat <- matrix(rnorm(200), 100, 2)

# Complete linkage hierarchical clustering
library(fastcluster)
hcl <- hclust(dist(dat, method="euclidean")^2, method="complete")

# plot dendrograms with the 1st and 2nd clusters (cut at the third level) 
# displayed in blue and orange 
plot(hcl)
rect_hier_clusters(hcl, k=3, which=1:2, border=c("blue", "orange"))

# Monte Carlo test for a difference in means between the blue and orange clusters
test_complete_hier_clusters_approx(X=dat, hcl=hcl, K=3, k1=1, k2=2, ndraws=1000)


lucylgao/clusterpval documentation built on July 4, 2023, 4:40 p.m.