shc: statistical Significance for Hierarchical Clustering (SHC)

Description Usage Arguments Details Value Author(s) References See Also Examples


Implements the Monte Carlo simulation based significance testing procedure for hierarchical clustering described in Kimes et al. (Biometrics, 2017). Statistical significance is evaluated at each node along the hierarchical tree (dendrogram) starting from the root using a Gaussian null hypothesis test. A corresponding family-wise error rate (FWER) controlling procedure is provided.


shc(x, metric = "euclidean", vecmet = NULL, matmet = NULL,
  linkage = "ward.D2", l = 2, alpha = 1, icovest = 1,
  bkgd_pca = FALSE, n_sim = 100, n_min = 10, rcpp = FALSE,
  ci = "2CI", null_alg = "hclust", ci_idx = 1, ci_emp = FALSE)



a dataset with n rows and p columns, with observations in rows and features in columns.


a string specifying the metric to be used in the hierarchical clustering procedure. This must be a metric accepted by dist or "cor" (to specify 1 - Pearson correlation). (default = "euclidean")


a function taking two vectors as input and returning a real-valued number which specifies how dissimilarities should be computed between rows of the data matrix (x). If non-NULL, will take precedence over metric. Only one of vecmet or matmet can be specified. (default = NULL)


a function taking a matrix as input and returning an object of class dist representing dissimilarities between rows of the data matrix (x). If non-NULL, will take precedence over metric. Only one of vecmet or matmet can be specified. (default = NULL)


a string specifying the linkage to be used in the hierarchical clustering procedure. This must be a linkage accepted by Rclusterpp.hclust if rcpp=TRUE, e.g. "ward", or stats::hclust if rcpp=FALSE, e.g. "ward.D2". (default = "ward.D2")


an integer value specifying the power of the Minkowski distance, if used. (default = 2)


a value between 0 and 1 specifying the desired level of the test. If no FWER control is desired, simply set alpha to 1. (default = 1)


an integer (1, 2 or 3) specifying the null covariance estimation method to be used. See null_eigval for more details. (default = 1)


a logical value specifying whether to use scaled PCA scores over raw data to estimate the background noise. When FALSE, raw estimate is used; when TRUE, minimum of PCA and raw estimates is used. (default = FALSE)


a numeric value specifying the number of simulations at each node for Monte Carlo testing. (default = 100)


an integer specifying the minimum number of observations needed to calculate a p-value. (default = 10)


a logical value whether to use the Rclusterpp package. Will only be used if the Rclusterpp package is available. (default = FALSE)


a string vector specifying the cluster indices to be used for testing along the dendrogram. Currently, options include: "2CI", "linkage". (default = "2CI")


a string vector specifying the clustering algorithms that should be used to cluster the simulated null ditributions. Currently, options include: "2means" and "hclust". While "hclust" typically provides greater power for rotation invariant combinations of metric and linkage function, if a non-rotation invariant metric, e.g. Pearson correlation, is used it is recommended that the "2means" option is specified. Note, null_alg and ci must be of equal length. (default = "hclust")


an integer between 1 and length(ci) specifiying which CI to use for the FWER stopping rule. This only has an effect if alpha is specified to a non-default value. (default = 1)


a logical value specifying whether to use the empirical p-value from the CI based on ci_idx for the FWER stopping rule. As with ci_idx, this only has an effect if alpha is specified to a non-default value. (default = FALSE)


When possible, the Rclusterpp should be used for hierarchical clustering by specifying rcpp = TRUE, except in the case of clustering by Pearson correlation (dist="cor"), for which we make use of WGCNA::cor with the usual stats::hclust.

The Rclusterpp package is no longer available on CRAN and must be installed from GitHub using devtools::install_github("nolanlab/Rclusterpp").

For standard minimum variance Ward's linkage clustering, if rcpp is TRUE, specify "ward" for linkage. However, if rcpp is FALSE, then "ward.D2" should be specified. See stats::hclust for details on changes to the function since R >= 3.0.4.

By default, p-values are computed for all nodes (alpha = 1). If the procedure should instead terminate when no further nodes meet a desired FWER control threshold, simply specify alpha < 1. Controlling the FWER may substantially speed up the procedure by reducing the number of tests considered.

Even if the procedure is run using alpha = 1, the FWER cutoffs may still be computed using fwer_cutoff. The FWER thresholding can also be visualized by specifying alpha when calling plot.

The input metric can either be a character string specifying a metric recognized by dist() or "cor" for Pearson correlation. Alternatively, a user-defined dissimilarity function can be specified suing either the matmet or vecmet parameter. If specified, matmet should be a function which takes a matrix as input and returns an object of class dist from the columns of the matrix. If specified, vecmet must be a function which computes a real-valued dissimilarity value between two input vectors. This function will be used to compute the dissimilarilty between columns of the data matrix. If either matmet or vecmet is specified, metric will be ignored. If both matmet and vecmet are specified, the function exit with an error. Examples using metric, matmet, and vecmet are provided below.


The function returns a shc S3-object containing the resulting p-values. The plot method can be used to generate a dendrogram with the corresponding p-values placed at each merge. The shc object has following attributes:


Patrick Kimes


See Also

plot-shc diagnostic


## using a string input to metric
data <- rbind(matrix(rnorm(100, mean = 2), ncol = 2),
              matrix(rnorm(100, mean = -1), ncol = 2))
shc_metric <- shc(data, metric = "cor", linkage = "average", alpha=0.05)
tail(shc_metric$p_norm, 10)

## using a function input to vecmet or any function not optimized for 
## computing dissimilarities for matrices will be incredibly slow
## should be avoided if possible
## Not run: 
vfun <- function(x, y) { 1 - cor(x, y) }
shc_vecmet <- shc(data, vecmet = vfun, linkage = "average", alpha=0.05)
tail(shc_vecmet$p_norm, 10)

## End(Not run)

## using a function input to matmet
mfun <- function(x) { as.dist(1-cor(x)) }
shc_mfun <- shc(data, matmet=mfun, linkage="average", alpha=0.05)
tail(shc_mfun$p_norm, 10)

pkimes/sigclust2 documentation built on May 25, 2019, 8:20 a.m.