VALIDICLUST
(VALID Inference for CLUsters Separation Testing) is a
package for ensuring valid inference after data clustering. This problem
occurs when clustering forces differences between groups of observations
to build clusters. This leads to an inflation of the Type I error rate,
especially because the data is used twice: i) to build clusters, i.e.,
to form hypotheses, and ii) to perform the inference step. The 3 main
functions of the package are:
test_selective_inference()
following the work of Gao et
al. [1] on selective inference for clustering
merge_selective_inference()
for the merging selective test. In
this method, all adjacent p-values are merged using the harmonic
mean
test_multimod()
uses the DipTest [2] to investigate the
presence of a continuum between two estimated clusters for a given
variable
Install the development version from GitHub
remotes::install_github("benhvt/VALIDICLUST")
To illustrate our proposed method, we use the
palmerpenguins dataset
from the palmerpenguins
package. To ensure that there are no truly
separate groups of observations, we selected only the female penguins of
the Adelie species, yielding 73 observations. On this dataset, we apply
hierarchical clustering (on Euclidean distances with Ward linkage) to
build 2 clusters. In this negative control dataset, we know the true,
i.e., the non-existing separated group of observations. We therefore
apply our 3 tests to each of the 4 numerical measurements to test for
separation of the 2 clusters, and compare the resulting p-values to the
p-values of the classical t-test.
data("penguins")
data <- subset(penguins, species == "Adelie" & sex == "female")
data <- data[,3:6]
# Clustering function
hcl2 <- function(x){
x <- scale(x, center = T, scale = T)
distance <- dist(x, method = "euclidean")
cah <- hclust(distance, method = "ward.D2")
return(as.factor(cutree(cah, k=2)))
}
data$Cluster <- hcl2(data)
ggpairs(data, aes(colour = Cluster, fill = Cluster)) +
scale_colour_manual(values = c("#F1786D","#53888f")) +
scale_fill_manual(values = c("#F1786D","#53888f")) +
theme_minimal()
pval.inference.selective <- pval.merge <- pval.multimod <- pval.t.test <- rep(NA, 4)
for (i in 1:4){
pval.inference.selective[i] <- test_selective_inference(as.matrix(data[,1:4]),
k1=1,
k2=2,
g=i,
cl_fun = hcl2,
cl=data$Cluster)$pval
pval.merge[i] <- merge_selective_inference(as.matrix(data[,1:4]),
k1=1,
k2=2,
g=i,
cl_fun = hcl2,
cl=data$Cluster)$pval
pval.multimod[i] <- test_multimod(as.matrix(data[,1:4]),
g=i,
k1=1,
k2=2,
cl = data$Cluster)$pval
pval.t.test[i] <- t.test(data[data$Cluster == 1,i], data[data$Cluster==2, i])$p.value
}
pval.res <- rbind(pval.inference.selective,
pval.merge,
pval.multimod,
pval.t.test)
colnames(pval.res) <- gsub("_", x=colnames(data)[1:4], replacement = " ")
rownames(pval.res) <- c("Selective Inference",
"Merging Selective Inference",
"Multimoality test",
"T-test")
round(pval.res,3) %>% htmlTable::htmlTable()
bill length mm
bill depth mm
flipper length mm
body mass g
Selective Inference
0.156
0.716
0.484
0.438
Merging Selective Inference
0.153
0.76
0.492
0.48
Multimoality test
0.806
0.193
0.044
0.599
T-test
0.4
0
0
0
[1] Gao, L. L., Bien, J., & Witten, D. (2022). Selective inference for hierarchical clustering. Journal of the American Statistical Association, (just-accepted), 1-27.
[2] HARTIGAN, John A. et HARTIGAN, Pamela M. The dip test of unimodality. The annals of Statistics, 1985, p. 70-84.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.