Description Usage Arguments Details Value Author(s) References See Also Examples
Implements the Monte Carlo simulation based significance testing procedure for hierarchical clustering described in Kimes et al. (Biometrics, 2017). Statistical significance is evaluated at each node along the hierarchical tree (dendrogram) starting from the root using a Gaussian null hypothesis test. A corresponding family-wise error rate (FWER) controlling procedure is provided.
1 2 3 4 |
x |
a dataset with n rows and p columns, with observations in rows and features in columns. |
metric |
a string specifying the metric to be used in the hierarchical
clustering procedure. This must be a metric accepted by |
vecmet |
a function taking two vectors as input and returning a real-valued
number which specifies how dissimilarities should be computed between rows
of the data matrix ( |
matmet |
a function taking a matrix as input and returning an object of class
|
linkage |
a string specifying the linkage to be used in the hierarchical
clustering procedure. This must be a linkage accepted by
|
l |
an integer value specifying the power of the Minkowski distance, if used. (default = 2) |
alpha |
a value between 0 and 1 specifying the desired level of the test. If no FWER control is desired, simply set alpha to 1. (default = 1) |
icovest |
an integer (1, 2 or 3) specifying the null covariance
estimation method to be used. See |
bkgd_pca |
a logical value specifying whether to use scaled PCA scores over raw data to estimate the background noise. When FALSE, raw estimate is used; when TRUE, minimum of PCA and raw estimates is used. (default = FALSE) |
n_sim |
a numeric value specifying the number of simulations at each node for Monte Carlo testing. (default = 100) |
n_min |
an integer specifying the minimum number of observations needed to calculate a p-value. (default = 10) |
rcpp |
a logical value whether to use the |
ci |
a string vector specifying the cluster indices to be used for testing along the dendrogram. Currently, options include: "2CI", "linkage". (default = "2CI") |
null_alg |
a string vector specifying the clustering algorithms that
should be used to cluster the simulated null ditributions. Currently, options
include: "2means" and "hclust". While "hclust" typically provides greater power
for rotation invariant combinations of metric and linkage function,
if a non-rotation invariant metric, e.g. Pearson correlation, is used it is
recommended that the "2means" option is specified. Note, |
ci_idx |
an integer between 1 and |
ci_emp |
a logical value specifying whether to use the empirical
p-value from the CI based on |
When possible, the Rclusterpp
should be used for
hierarchical clustering by specifying rcpp = TRUE
, except in the case of clustering by Pearson
correlation (dist="cor"
), for which we make use of
WGCNA::cor
with the usual stats::hclust
.
The Rclusterpp
package is no longer available on CRAN and must be
installed from GitHub using devtools::install_github("nolanlab/Rclusterpp")
.
For standard minimum variance Ward's linkage clustering, if rcpp
is
TRUE
, specify "ward" for linkage
. However, if rcpp
is
FALSE
, then "ward.D2" should be specified. See stats::hclust
for details on changes to the function since R >= 3.0.4.
By default, p-values are computed for all nodes (alpha = 1
).
If the procedure should instead terminate when no further nodes
meet a desired FWER control threshold, simply specify alpha < 1
.
Controlling the FWER may substantially speed up
the procedure by reducing the number of tests considered.
Even if the procedure is run using alpha = 1
, the FWER cutoffs may still be
computed using fwer_cutoff
. The FWER thresholding can also be visualized by
specifying alpha
when calling plot
.
The input metric
can either be a character string specifying a metric
recognized by dist()
or "cor"
for Pearson correlation.
Alternatively, a user-defined dissimilarity function can be specified suing either
the matmet
or vecmet
parameter. If specified, matmet
should be a
function which takes a matrix as input and returns an object of class dist
from the columns of the matrix. If specified, vecmet
must be a function
which computes a real-valued dissimilarity value between two input vectors. This
function will be used to compute the dissimilarilty between columns of the data matrix.
If either matmet
or vecmet
is specified, metric
will be ignored.
If both matmet
and vecmet
are specified, the function exit with an error.
Examples using metric
, matmet
, and vecmet
are provided below.
The function returns a shc
S3-object containing the
resulting p-values. The plot
method can be used to generate a dendrogram
with the corresponding p-values placed at each merge. The shc
object has following attributes:
in_mat
: the original data matrix passed to the constructor
in_args
: a list of the original parameters passed to the constructor
eigval_dat
: a matrix with each row containing the sample eigenvalues for a
subtree tested along the dendrogram
eigval_sim
: a matrix with each row containing the estimated eigenvalues used to
simulate null data at a subtree tested along the dendrogram
backvar
: a vector containing the estimated background variances used for
computing eigval_sim
nd_type
: a vector of length n-1 taking values in "n_small
",
"no_test
", "sig
", "not_sig
", or "NA
" specifying how each
node along the dendrogram was handled by the iterative testing procedure
ci_dat
: a matrix containing the cluster indices for the original
data matrix passed to the constructor
ci_sim
: a 3-dimensional array containing the simulated cluster
indices at each subtree tested along the dendrogram
p_emp
: a matrix containing the emprical p-values computed at each
subtree tested along the dendrogram
p_norm
: a matrix containing the Gaussian approximate p-values
computed at each subtree tested along the dendrogram
idx_hc
: a list of tuples containing the indices of clusters joined
at each step of the hierarchical clustering procedure
hc_dat
: a hclust
object constructed from the original data matrix and
arguments passed to the constructor
Patrick Kimes
Kimes, P. K., Hayes, D. N., Liu Y., and Marron, J. S. (2016) Statistical significance for hierarchical clustering. pre-print available.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 | ## using a string input to metric
data <- rbind(matrix(rnorm(100, mean = 2), ncol = 2),
matrix(rnorm(100, mean = -1), ncol = 2))
shc_metric <- shc(data, metric = "cor", linkage = "average", alpha=0.05)
tail(shc_metric$p_norm, 10)
## using a function input to vecmet or any function not optimized for
## computing dissimilarities for matrices will be incredibly slow
## should be avoided if possible
## Not run:
vfun <- function(x, y) { 1 - cor(x, y) }
shc_vecmet <- shc(data, vecmet = vfun, linkage = "average", alpha=0.05)
tail(shc_vecmet$p_norm, 10)
## End(Not run)
## using a function input to matmet
mfun <- function(x) { as.dist(1-cor(x)) }
shc_mfun <- shc(data, matmet=mfun, linkage="average", alpha=0.05)
tail(shc_mfun$p_norm, 10)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.