shc: statistical Significance for Hierarchical Clustering (SHC)
In pkimes/sigclust2: sigclust2: Statistical Significance of Clustering

Description Usage Arguments Details Value Author(s) References See Also Examples

Implements the Monte Carlo simulation based significance testing procedure for hierarchical clustering described in Kimes et al. (Biometrics, 2017). Statistical significance is evaluated at each node along the hierarchical tree (dendrogram) starting from the root using a Gaussian null hypothesis test. A corresponding family-wise error rate (FWER) controlling procedure is provided.

shc(x, metric = "euclidean", vecmet = NULL, matmet = NULL,
  linkage = "ward.D2", l = 2, alpha = 1, icovest = 1,
  bkgd_pca = FALSE, n_sim = 100, n_min = 10, rcpp = FALSE,
  ci = "2CI", null_alg = "hclust", ci_idx = 1, ci_emp = FALSE)

`x`	a dataset with n rows and p columns, with observations in rows and features in columns.
`metric`	a string specifying the metric to be used in the hierarchical clustering procedure. This must be a metric accepted by `dist` or "cor" (to specify 1 - Pearson correlation). (default = "euclidean")
`vecmet`	a function taking two vectors as input and returning a real-valued number which specifies how dissimilarities should be computed between rows of the data matrix (`x`). If non-NULL, will take precedence over `metric`. Only one of `vecmet` or `matmet` can be specified. (default = NULL)
`matmet`	a function taking a matrix as input and returning an object of class `dist` representing dissimilarities between rows of the data matrix (`x`). If non-NULL, will take precedence over `metric`. Only one of `vecmet` or `matmet` can be specified. (default = NULL)
`linkage`	a string specifying the linkage to be used in the hierarchical clustering procedure. This must be a linkage accepted by `Rclusterpp.hclust` if `rcpp=TRUE`, e.g. "ward", or `stats::hclust` if `rcpp=FALSE`, e.g. "ward.D2". (default = "ward.D2")
`l`	an integer value specifying the power of the Minkowski distance, if used. (default = 2)
`alpha`	a value between 0 and 1 specifying the desired level of the test. If no FWER control is desired, simply set alpha to 1. (default = 1)
`icovest`	an integer (1, 2 or 3) specifying the null covariance estimation method to be used. See `null_eigval` for more details. (default = 1)
`bkgd_pca`	a logical value specifying whether to use scaled PCA scores over raw data to estimate the background noise. When FALSE, raw estimate is used; when TRUE, minimum of PCA and raw estimates is used. (default = FALSE)
`n_sim`	a numeric value specifying the number of simulations at each node for Monte Carlo testing. (default = 100)
`n_min`	an integer specifying the minimum number of observations needed to calculate a p-value. (default = 10)
`rcpp`	a logical value whether to use the `Rclusterpp` package. Will only be used if the `Rclusterpp` package is available. (default = FALSE)
`ci`	a string vector specifying the cluster indices to be used for testing along the dendrogram. Currently, options include: "2CI", "linkage". (default = "2CI")
`null_alg`	a string vector specifying the clustering algorithms that should be used to cluster the simulated null ditributions. Currently, options include: "2means" and "hclust". While "hclust" typically provides greater power for rotation invariant combinations of metric and linkage function, if a non-rotation invariant metric, e.g. Pearson correlation, is used it is recommended that the "2means" option is specified. Note, `null_alg` and `ci` must be of equal length. (default = "hclust")
`ci_idx`	an integer between 1 and `length(ci)` specifiying which CI to use for the FWER stopping rule. This only has an effect if `alpha` is specified to a non-default value. (default = 1)
`ci_emp`	a logical value specifying whether to use the empirical p-value from the CI based on `ci_idx` for the FWER stopping rule. As with `ci_idx`, this only has an effect if `alpha` is specified to a non-default value. (default = FALSE)

When possible, the Rclusterpp should be used for hierarchical clustering by specifying rcpp = TRUE, except in the case of clustering by Pearson correlation (dist="cor"), for which we make use of WGCNA::cor with the usual stats::hclust.

The Rclusterpp package is no longer available on CRAN and must be installed from GitHub using devtools::install_github("nolanlab/Rclusterpp").

For standard minimum variance Ward's linkage clustering, if rcpp is TRUE, specify "ward" for linkage. However, if rcpp is FALSE, then "ward.D2" should be specified. See stats::hclust for details on changes to the function since R >= 3.0.4.

By default, p-values are computed for all nodes (alpha = 1). If the procedure should instead terminate when no further nodes meet a desired FWER control threshold, simply specify alpha < 1. Controlling the FWER may substantially speed up the procedure by reducing the number of tests considered.

Even if the procedure is run using alpha = 1, the FWER cutoffs may still be computed using fwer_cutoff. The FWER thresholding can also be visualized by specifying alpha when calling plot.

The input metric can either be a character string specifying a metric recognized by dist() or "cor" for Pearson correlation. Alternatively, a user-defined dissimilarity function can be specified suing either the matmet or vecmet parameter. If specified, matmet should be a function which takes a matrix as input and returns an object of class dist from the columns of the matrix. If specified, vecmet must be a function which computes a real-valued dissimilarity value between two input vectors. This function will be used to compute the dissimilarilty between columns of the data matrix. If either matmet or vecmet is specified, metric will be ignored. If both matmet and vecmet are specified, the function exit with an error. Examples using metric, matmet, and vecmet are provided below.

The function returns a shc S3-object containing the resulting p-values. The plot method can be used to generate a dendrogram with the corresponding p-values placed at each merge. The shc object has following attributes:

in_mat: the original data matrix passed to the constructor
in_args: a list of the original parameters passed to the constructor
eigval_dat: a matrix with each row containing the sample eigenvalues for a subtree tested along the dendrogram
eigval_sim: a matrix with each row containing the estimated eigenvalues used to simulate null data at a subtree tested along the dendrogram
backvar: a vector containing the estimated background variances used for computing eigval_sim
nd_type: a vector of length n-1 taking values in "n_small", "no_test", "sig", "not_sig", or "NA" specifying how each node along the dendrogram was handled by the iterative testing procedure
ci_dat: a matrix containing the cluster indices for the original data matrix passed to the constructor
ci_sim: a 3-dimensional array containing the simulated cluster indices at each subtree tested along the dendrogram
p_emp: a matrix containing the emprical p-values computed at each subtree tested along the dendrogram
p_norm: a matrix containing the Gaussian approximate p-values computed at each subtree tested along the dendrogram
idx_hc: a list of tuples containing the indices of clusters joined at each step of the hierarchical clustering procedure
hc_dat: a hclust object constructed from the original data matrix and arguments passed to the constructor

Patrick Kimes

Kimes, P. K., Hayes, D. N., Liu Y., and Marron, J. S. (2016) Statistical significance for hierarchical clustering. pre-print available.

plot-shc diagnostic

## using a string input to metric
data <- rbind(matrix(rnorm(100, mean = 2), ncol = 2),
              matrix(rnorm(100, mean = -1), ncol = 2))
shc_metric <- shc(data, metric = "cor", linkage = "average", alpha=0.05)
tail(shc_metric$p_norm, 10)

## using a function input to vecmet or any function not optimized for 
## computing dissimilarities for matrices will be incredibly slow
## should be avoided if possible
## Not run: 
vfun <- function(x, y) { 1 - cor(x, y) }
shc_vecmet <- shc(data, vecmet = vfun, linkage = "average", alpha=0.05)
tail(shc_vecmet$p_norm, 10)

## End(Not run)

## using a function input to matmet
mfun <- function(x) { as.dist(1-cor(x)) }
shc_mfun <- shc(data, matmet=mfun, linkage="average", alpha=0.05)
tail(shc_mfun$p_norm, 10)