test_clusters_approx: Monte Carlo significance test for any clustering method

View source: R/trunc_inf.R

test_clusters_approxR Documentation

Monte Carlo significance test for any clustering method

Description

This function performs a user-specified clustering method cl_fun on the rows of a data matrix to obtain K clusters, and tests the null hypothesis of no difference in means between clusters k1 and k2.

Usage

test_clusters_approx(
  X,
  k1,
  k2,
  iso = TRUE,
  sig = NULL,
  SigInv = NULL,
  ndraws = 2000,
  cl_fun,
  cl = NULL
)

Arguments

X

n by p matrix containing numeric data.

k1, k2

Integers selecting the clusters to test.

iso

Boolean. If TRUE, isotropic covariance matrix model, otherwise not.

sig

Optional scalar specifying \sigma, relevant if iso is TRUE.

SigInv

Optional matrix specifying \Sigma^{-1}, relevant if iso is FALSE.

ndraws

Integer selecting the number of importance samples, default of 2000.

cl_fun

Function returning assignments to clusters 1 through K.

cl

Optionally pass in the results of calling cl_fun on your data. This is for efficiency and reproducibility (when the clustering function is non-deterministic).

Details

In order to account for the fact that the clusters have been estimated from the data, the p-values are computed conditional on the fact that those clusters were estimated. This function approximates p-values via importance sampling.

This function assumes that cl_fun takes a n \times p numeric data matrix as input and outputs integer assignments to clusters 1 through K.

Thank you to August Guang for providing code to speed-up the function by parallelizing via the future package.

Value

stat

the test statistic: the Euclidean distance between the mean of cluster k1 and the mean of cluster k2

pval

the approximate p-value

stderr

standard error of the p-value estimate

clusters

the estimated cluster assignments

References

Lucy L. Gao et al. "Selective inference for hierarchical clustering".

Examples

# Simulates a 100 x 2 data set with three clusters
set.seed(123)
dat <- rbind(c(-1, 0), c(0, sqrt(3)), c(1, 0))[rep(1:3, length=100), ] + 
matrix(0.2*rnorm(200), 100, 2)

# Function to run k-means clustering w/ k = 3 and 50 random starts
km_cluster <- function(X) { 
 km <- kmeans(X, 3, nstart=50)
 return(km$cluster)
}

# Cluster data using k-means 
clusters <- km_cluster(dat)
table(rep(1:3, length=100), clusters)

# tests for a difference in means between clusters 1 and 2
# We pass in earlier k-means clustering results from earlier
results <- test_clusters_approx(dat, k1=1, k2=2, cl_fun=km_cluster, ndraws=500, cl=clusters)
results$stat
results$pval
results$stderr


lucylgao/clusterpval documentation built on July 4, 2023, 4:40 p.m.