| bicor | R Documentation |
Computes pairwise biweight mid-correlations for numeric data. Bicor is a robust, Pearson-like correlation that down-weights outliers and heavy-tailed observations. Optional large-sample confidence intervals are available as a derived feature.
bicor(
data,
na_method = c("error", "pairwise"),
ci = FALSE,
conf_level = 0.95,
n_threads = getOption("matrixCorr.threads", 1L),
output = c("matrix", "sparse", "edge_list"),
threshold = 0,
diag = TRUE,
c_const = 9,
max_p_outliers = 1,
pearson_fallback = c("hybrid", "none", "all"),
mad_consistent = FALSE,
w = NULL,
sparse_threshold = NULL
)
diag.bicor(x, ...)
## S3 method for class 'bicor'
print(
x,
digits = 4,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
ci_digits = 3,
show_ci = NULL,
na_print = "NA",
...
)
## S3 method for class 'bicor'
plot(
x,
title = "Biweight mid-correlation heatmap",
reorder = c("none", "hclust"),
triangle = c("full", "lower", "upper"),
low_color = "indianred1",
mid_color = "white",
high_color = "steelblue1",
value_text_size = 3,
ci_text_size = 2.5,
show_value = TRUE,
na_fill = "grey90",
...
)
## S3 method for class 'bicor'
summary(
object,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
ci_digits = 3,
p_digits = 4,
show_ci = NULL,
...
)
## S3 method for class 'summary.bicor'
print(
x,
digits = NULL,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
show_ci = NULL,
...
)
data |
A numeric matrix or a data frame containing numeric columns.
Factors, logicals and common time classes are dropped in the data-frame
path. Missing values are not allowed unless |
na_method |
One of |
ci |
Logical (default |
conf_level |
Confidence level used when |
n_threads |
Integer |
output |
Output representation for the computed estimates.
|
threshold |
Non-negative absolute-value filter for non-matrix outputs:
keep entries with |
diag |
Logical; whether to include diagonal entries in
|
c_const |
Positive numeric. Tukey biweight tuning constant applied to the
raw MAD; default |
max_p_outliers |
Numeric in |
pearson_fallback |
Character scalar indicating the fallback policy. One of:
|
mad_consistent |
Logical; if |
w |
Optional non-negative numeric vector of length |
sparse_threshold |
Optional numeric |
x |
An object of class |
... |
Additional arguments passed to |
digits |
Integer; number of decimal places used for the matrix. |
n |
Optional row threshold for compact preview output. |
topn |
Optional number of leading/trailing rows to show when truncated. |
max_vars |
Optional maximum number of visible columns; |
width |
Optional display width; defaults to |
ci_digits |
Integer; digits for bicor confidence limits in the pairwise summary. |
show_ci |
One of |
na_print |
Character; how to display missing values. |
title |
Plot title. Default is |
reorder |
Character; one of |
triangle |
One of |
low_color, mid_color, high_color |
Colours for the gradient in
|
value_text_size |
Numeric; font size for cell labels. Set to |
ci_text_size |
Text size for confidence-interval labels in the heatmap. |
show_value |
Logical; if |
na_fill |
Fill colour for |
object |
An object of class |
p_digits |
Integer; digits for bicor p-values in the pairwise summary. |
For a column x = (x_a)_{a=1}^m, let \mathrm{med}(x) be the median and
\mathrm{MAD}(x) = \mathrm{med}(|x - \mathrm{med}(x)|) the (raw) median
absolute deviation. If mad_consistent = TRUE, the consistent scale
\mathrm{MAD}^\star(x) = 1.4826\,\mathrm{MAD}(x) is used. With tuning constant
c>0, define
u_a = \frac{x_a - \mathrm{med}(x)}{c\,\mathrm{MAD}^{(\star)}(x)}.
The Tukey biweight gives per-observation weights
w_a = (1 - u_a^2)^2\,\mathbf{1}\{|u_a| < 1\}.
Robust standardisation of a column is
\tilde x_a =
\frac{(x_a - \mathrm{med}(x))\,w_a}{
\sqrt{\sum_{b=1}^m \big[(x_b - \mathrm{med}(x))\,w_b\big]^2}}.
For two columns x,y, the biweight mid-correlation is
\mathrm{bicor}(x,y) = \sum_{a=1}^m \tilde x_a\,\tilde y_a \in [-1,1].
Capping the maximum proportion of outliers (max_p_outliers).
If max_p_outliers < 1, let q_L = Q_x(\text{max\_p\_outliers}) and
q_U = Q_x(1-\text{max\_p\_outliers}) be the lower/upper quantiles of x.
If the corresponding |u| at either quantile exceeds 1, u is rescaled
separately on the negative and positive sides so that those quantiles land at
|u|=1. This guarantees that all observations between the two quantiles receive
positive weight. Note the bound applies per side, so up to 2\,\text{max\_p\_outliers}
of observations can be treated as outliers overall.
Fallback when for zero MAD / degeneracy (pearson_fallback).
If a column has \mathrm{MAD}=0 or the robust denominator becomes zero,
the following rules apply:
"none" when correlations involving that column are NA (diagonal
remains 1).
"hybrid" when only the affected column switches to Pearson standardisation
\bar x_a = (x_a - \overline{x}) / \sqrt{\sum_b (x_b - \overline{x})^2},
yielding the hybrid correlation
\mathrm{bicor}_{\mathrm{hyb}}(x,y) = \sum_a \bar x_a\,\tilde y_a,
with the other column still robust-standardised.
"all" when all columns use ordinary Pearson standardisation; the result
equals stats::cor(..., method="pearson") when the NA policy matches.
Handling missing values (na_method).
"error" (default): inputs must be finite; this yields a symmetric,
positive semidefinite (PSD) matrix since R = \tilde X^\top \tilde X.
"pairwise": each R_{jk} is computed on the intersection of
rows where both columns are finite. Pairs with fewer than 5 overlapping
rows return NA (guarding against instability). Pairwise deletion can
break PSD, as in the Pearson case.
Row weights (w).
When w is supplied (non-negative, length m), the weighted median
\mathrm{med}_w(x) and weighted MAD
\mathrm{MAD}_w(x) = \mathrm{med}_w(|x - \mathrm{med}_w(x)|) are used to form
u. The Tukey weights are then multiplied by the observation weights prior
to normalisation:
\tilde x_a =
\frac{(x_a - \mathrm{med}_w(x))\,w_a\,w^{(\mathrm{obs})}_a}{
\sqrt{\sum_b \big[(x_b - \mathrm{med}_w(x))\,w_b\,w^{(\mathrm{obs})}_b\big]^2}},
where w^{(\mathrm{obs})}_a \ge 0 are the user-supplied row weights and w_a
are the Tukey biweights built from the weighted median/MAD. Weighted pairwise
behaves analogously on each column pair's overlap.
MAD choice (mad_consistent).
Setting mad_consistent = TRUE multiplies the raw MAD by 1.4826 inside
u. Equivalently, it uses an effective tuning constant
c^\star = c \times 1.4826. The default FALSE reproduces the convention
in Langfelder & Horvath (2012).
Optional sparsification (sparse_threshold).
If provided, entries with |r| < \text{sparse\_threshold} are set to 0 and the
result is returned as a "ddiMatrix" (diagonal is forced to 1). This is a
post-processing step that does not alter the per-pair estimates.
Computation and threads.
Columns are robust-standardised in parallel and the matrix is formed as
R = \tilde X^\top \tilde X. n_threads selects the number of OpenMP
threads; by default it uses getOption("matrixCorr.threads", 1L).
Large-sample inference.
For a pairwise estimate r computed from n_{jk} observed rows, the
standard large-sample summaries use
Z_{jk} = \frac{1}{2}\log\!\left(\frac{1 + r}{1 - r}\right)\sqrt{n_{jk} - 2}
and
T_{jk} = \sqrt{n_{jk} - 2}\;\frac{|r|}{\sqrt{1-r^2}}.
The reported p-value is the two-sided Student-t tail probability with
n_{jk}-2 degrees of freedom. When ci = TRUE, the package also
reports an approximate Fisher-z confidence interval obtained from
z_{jk} = \operatorname{atanh}(r), \qquad
\operatorname{SE}(z_{jk}) = \frac{1}{\sqrt{n_{jk}-3}},
followed by back-transformation with tanh(). Confidence intervals are currently available only for dense,
unweighted outputs.
Basic properties.
\mathrm{bicor}(a x + b,\; c y + d) = \mathrm{sign}(ac)\,\mathrm{bicor}(x,y).
With no missing data (and with per-column hybrid/robust standardisation), the
output is symmetric and PSD. As with Pearson, affine equivariance does not hold
for the associated biweight midcovariance.
A symmetric correlation matrix with class bicor
(or a dgCMatrix if sparse_threshold is used), with attributes:
method = "biweight_mid_correlation", description,
and package = "matrixCorr". Downstream code should be prepared to
handle either a dense numeric matrix or a sparse dgCMatrix. When
ci = TRUE, the object also carries a ci attribute with
elements est, lwr.ci, upr.ci, conf.level, and
ci.method, together with an inference attribute containing
the standard large-sample summary matrices estimate,
statistic, p_value, Z, and n_obs. Pairwise
complete-case counts are stored in attr(x, "diagnostics")$n_complete.
Internally, all medians/MADs, Tukey weights, optional pairwise-NA handling,
and OpenMP loops are implemented in the C++ helpers
(bicor_*_cpp()), so the R wrapper mostly validates arguments and
dispatches to the appropriate backend.
Invisibly returns x.
A ggplot object.
Thiago de Paula Oliveira
Langfelder, P. & Horvath, S. (2012). Fast R Functions for Robust Correlations and Hierarchical Clustering. Journal of Statistical Software, 46(11), 1-17. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v046.i11")}
set.seed(1)
X <- matrix(rnorm(2000 * 40), 2000, 40)
R <- bicor(X, c_const = 9, max_p_outliers = 1,
pearson_fallback = "hybrid")
print(attr(R, "method"))
summary(R)
R_ci <- bicor(X[, 1:5], ci = TRUE)
summary(R_ci)
# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
view_corr_shiny(R)
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.