UNPaC_Copula: Unimodal Non-Parametric Cluster (UNPaC) Significance Test

View source: R/UNPaC_Copula.R

UNPaC_CopulaR Documentation

Unimodal Non-Parametric Cluster (UNPaC) Significance Test

Description

The UnPAC test assesses the significance of clusters by comparing the cluster index (CI) from the data to the CI from a ortho-unimodal reference data generated using a Gaussian copula. This method is described in Helgeson, Vock, and Bair (2021). The CI is defined to be the sum of the within-cluster sum of squares about the cluster means divided by the total sum of squares. Smaller values of the CI indicate a stronger clustering.

Usage

UNPaC_Copula(
  x,
  cluster,
  cluster.fun,
  nsim = 100,
  var_selection = FALSE,
  gamma = 0.1,
  p.adjust = "fdr",
  k = 2,
  rho = 0.02,
  cov = "glasso",
  center = TRUE,
  scale = FALSE
)

Arguments

x

a dataset with n observations (rows) and p features (columns)

cluster

labels generated by clustering method

cluster.fun

function used to cluster data. Function should return list containing a component "cluster." Examples include kmeans and pam.

nsim

a numeric value specifying the number of unimodal reference distributions used for testing (default=100)

var_selection

should dimension be reduced using feature filtering procedure? See description below. (default=FALSE)

gamma

threshold for feature filtering procedure. See description below. Not used if var_selection=FALSE (default=0.10)

p.adjust

p-value adjustment method for additional feature filtering. See p.adjust for options. (default="fdr"). Not used if p.adjust="none."

k

integer value specifying the number of clusters to test (default=2)

rho

a regularization parameter used in implementation of the graphical lasso. See documentation for lambda in huge. Not used if cov="est" or cov="banded"

cov

method used for approximating the covariance structure. options include: "glasso" (See huge), "banded" (See band.chol.cv) and "est" (default = "glasso")

center

should data be centered such that each feature has mean equal to zero prior to clustering (default=TRUE)

scale

should data be scaled such that each feature has variance equal to one prior to clustering (default=FALSE)

Details

There are three options for the covariance matrix used in generating the Gaussian copula: sample covariance estimation, cov="est", which should be used if n>p; the graphical lasso, cov="glasso", which should be used if n<p; and k-banded covariance, cov="banded", which can be used if n<p and it can be assumed that features farther away in the ordering have weaker covariance. The graphical lasso is implemented using the huge function. When cov="banded" is selected the k-banded covariance Cholesky factor of Rothman, Levina, and Zhu (2010) is used to estimate the covariance matrix. Cross-validation is used for selecting the banding parameter. See documentation in band.chol.cv.

In high dimensional (n<p) settings a dimension reduction step can be implemented which selects features based on an F-test for difference in means across clusters. Features having a p-value less than a threshold gamma are retained. For additional feature filtering a p-value adjustment procedure (such as p.adjust="fdr") can be used. If no features are retained the resulting p-value for the cluster significance test is given as 1.

Value

The function returns a list with the following components:

  • selected_features: A vector of integers indicating the features retained by the feature filtering process.

  • sim_CI: vector containing the cluster indices for each generated unimodal reference distribution

  • pvalue_emp: the empirical p-value: the proportion of times the cluster index from the reference data is smaller the cluster index from the observed data

  • pvalue_norm: the normalized p-value: the simulated p-value based on comparison to a standard normal distribution

Author(s)

Erika S. Helgeson, David Vock, Eric Bair

References

  • Helgeson, ES, Vock, DM, and Bair, E. (2021) “Nonparametric cluster significance testing with reference to a unimodal null distribution." Biometrics 77: 1215– 1226. < https://doi.org/10.1111/biom.13376 >

  • Rothman, A. J., Levina, E., and Zhu, J. (2010). “A new approach to Cholesky-based covariance regularization in high dimensions." Biometrika 97(3): 539-550.

Examples

# K-means example
test1 <- matrix(rnorm(100*50), nrow=100, ncol=50)
test1[1:30,1:50] <- rnorm(30*50, 2)
test.data<-scale(test1,scale=FALSE,center=TRUE)
cluster<-kmeans(test.data,2)$cluster
UNPaCResults <- UNPaC_Copula(test.data,cluster,kmeans, nsim=100,cov="est")

# Hierarchical clustering example
 
test <- matrix(nrow=1200, ncol=75)
theta <- rep(NA, 1200)
theta[1:500] <- runif(500, 0, pi)
theta[501:1200] <- runif(700, pi, 2*pi)
test[1:500,seq(from=2,to=50,by=2)] <- -2+5*sin(theta[1:500])
test[501:1200,seq(from=2,to=50,by=2)] <- 5*sin(theta[501:1200])
test[1:500,seq(from=1,to=49,by=2)] <- 5+5*cos(theta[1:500])
test[501:1200,seq(from=1,to=49,by=2)] <- 5*cos(theta[501:1200])
test[,1:50] <- test[,1:50] + rnorm(50*1200, 0, 0.2)
test[,51:75] <- rnorm(25*1200, 0, 1)
test.data<-scale(test,center=TRUE,scale=FALSE)
# Defining clustering function
hclustFunction<-function(x,k){
 D<-stats::dist(x)
 xn.hc <- hclust(D, method="single")
 list(cluster=cutree(xn.hc, k))}

cluster=hclustFunction(test.data,2)$cluster
UNPaCResults <- UNPaC_Copula(test.data,cluster,hclustFunction, nsim=100,cov="est")


UNPaC documentation built on June 10, 2022, 1:06 a.m.

Related to UNPaC_Copula in UNPaC...