bicor: Biweight mid-correlation (bicor)

View source: R/bicor.R

bicorR Documentation

Biweight mid-correlation (bicor)

Description

Computes pairwise biweight mid-correlations for numeric data. Bicor is a robust, Pearson-like correlation that down-weights outliers and heavy-tailed observations. Optional large-sample confidence intervals are available as a derived feature.

Usage

bicor(
  data,
  na_method = c("error", "pairwise"),
  ci = FALSE,
  conf_level = 0.95,
  n_threads = getOption("matrixCorr.threads", 1L),
  output = c("matrix", "sparse", "edge_list"),
  threshold = 0,
  diag = TRUE,
  c_const = 9,
  max_p_outliers = 1,
  pearson_fallback = c("hybrid", "none", "all"),
  mad_consistent = FALSE,
  w = NULL,
  sparse_threshold = NULL
)

diag.bicor(x, ...)

## S3 method for class 'bicor'
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  show_ci = NULL,
  na_print = "NA",
  ...
)

## S3 method for class 'bicor'
plot(
  x,
  title = "Biweight mid-correlation heatmap",
  reorder = c("none", "hclust"),
  triangle = c("full", "lower", "upper"),
  low_color = "indianred1",
  mid_color = "white",
  high_color = "steelblue1",
  value_text_size = 3,
  ci_text_size = 2.5,
  show_value = TRUE,
  na_fill = "grey90",
  ...
)

## S3 method for class 'bicor'
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  p_digits = 4,
  show_ci = NULL,
  ...
)

## S3 method for class 'summary.bicor'
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Arguments

data

A numeric matrix or a data frame containing numeric columns. Factors, logicals and common time classes are dropped in the data-frame path. Missing values are not allowed unless na_method = "pairwise".

na_method

One of "error" (default, fastest) or "pairwise". With "pairwise", each (j,k) correlation is computed on the intersection of non-missing rows for the pair.

ci

Logical (default FALSE). If TRUE, attach approximate Fisher-z confidence intervals for the off-diagonal biweight mid-correlations. This confidence interval is provided as an additional large-sample approximation.

conf_level

Confidence level used when ci = TRUE. Default is 0.95.

n_threads

Integer \geq 1. Number of OpenMP threads. Defaults to getOption("matrixCorr.threads", 1L).

output

Output representation for the computed estimates.

  • "matrix" (default): full dense matrix; best when you need matrix algebra, dense heatmaps, or full compatibility with existing code.

  • "sparse": sparse matrix from Matrix containing only retained entries; best when many values are dropped by thresholding.

  • "edge_list": long-form data frame with columns row, col, value; convenient for filtering, joins, and network-style workflows.

threshold

Non-negative absolute-value filter for non-matrix outputs: keep entries with abs(value) >= threshold. Use threshold > 0 when you want only stronger associations (typically with output = "sparse" or "edge_list"). Keep threshold = 0 to retain all values. Must be 0 when output = "matrix".

diag

Logical; whether to include diagonal entries in "sparse" and "edge_list" outputs.

c_const

Positive numeric. Tukey biweight tuning constant applied to the raw MAD; default 9 (Langfelder & Horvath's convention).

max_p_outliers

Numeric in (0, 1]. Optional cap on the maximum proportion of outliers on each side; if < 1, side-specific rescaling maps those quantiles to |u|=1. Use 1 to disable.

pearson_fallback

Character scalar indicating the fallback policy. One of:

  • "hybrid" (default): if a column has MAD = 0, that column uses Pearson standardisation, yielding a hybrid correlation.

  • "none": return NA if a column has MAD = 0 or becomes degenerate after weighting.

  • "all": force ordinary Pearson for all columns.

mad_consistent

Logical; if TRUE, use the normal-consistent MAD (MAD_raw * 1.4826) in the bicor weights. Default FALSE to match Langfelder & Horvath (2012).

w

Optional non-negative numeric vector of length nrow(data) giving row weights. When supplied, weighted medians/MADs are used and Tukey weights are multiplied by w before normalisation.

sparse_threshold

Optional numeric \geq 0. If supplied, sets entries with |r| < sparse_threshold to 0 and returns a sparse "ddiMatrix" (requires Matrix).

x

An object of class summary.bicor.

...

Additional arguments passed to ggplot2::theme() or other layers.

digits

Integer; number of decimal places used for the matrix.

n

Optional row threshold for compact preview output.

topn

Optional number of leading/trailing rows to show when truncated.

max_vars

Optional maximum number of visible columns; NULL derives this from console width.

width

Optional display width; defaults to getOption("width").

ci_digits

Integer; digits for bicor confidence limits in the pairwise summary.

show_ci

One of "yes" or "no".

na_print

Character; how to display missing values.

title

Plot title. Default is "Biweight mid-correlation heatmap".

reorder

Character; one of "none" (default) or "hclust". If "hclust", variables are reordered by complete-linkage clustering on the distance d = 1 - r, after replacing NA by 0 for clustering purposes only.

triangle

One of "full" (default), "lower", or "upper" to display the full matrix or a single triangle.

low_color, mid_color, high_color

Colours for the gradient in scale_fill_gradient2. Defaults are "indianred1", "white", "steelblue1".

value_text_size

Numeric; font size for cell labels. Set to NULL to suppress labels (recommended for large matrices).

ci_text_size

Text size for confidence-interval labels in the heatmap.

show_value

Logical; if TRUE (default), overlay numeric values on the heatmap tiles.

na_fill

Fill colour for NA cells. Default "grey90".

object

An object of class bicor.

p_digits

Integer; digits for bicor p-values in the pairwise summary.

Details

For a column x = (x_a)_{a=1}^m, let \mathrm{med}(x) be the median and \mathrm{MAD}(x) = \mathrm{med}(|x - \mathrm{med}(x)|) the (raw) median absolute deviation. If mad_consistent = TRUE, the consistent scale \mathrm{MAD}^\star(x) = 1.4826\,\mathrm{MAD}(x) is used. With tuning constant c>0, define

u_a = \frac{x_a - \mathrm{med}(x)}{c\,\mathrm{MAD}^{(\star)}(x)}.

The Tukey biweight gives per-observation weights

w_a = (1 - u_a^2)^2\,\mathbf{1}\{|u_a| < 1\}.

Robust standardisation of a column is

\tilde x_a = \frac{(x_a - \mathrm{med}(x))\,w_a}{ \sqrt{\sum_{b=1}^m \big[(x_b - \mathrm{med}(x))\,w_b\big]^2}}.

For two columns x,y, the biweight mid-correlation is

\mathrm{bicor}(x,y) = \sum_{a=1}^m \tilde x_a\,\tilde y_a \in [-1,1].

Capping the maximum proportion of outliers (max_p_outliers). If max_p_outliers < 1, let q_L = Q_x(\text{max\_p\_outliers}) and q_U = Q_x(1-\text{max\_p\_outliers}) be the lower/upper quantiles of x. If the corresponding |u| at either quantile exceeds 1, u is rescaled separately on the negative and positive sides so that those quantiles land at |u|=1. This guarantees that all observations between the two quantiles receive positive weight. Note the bound applies per side, so up to 2\,\text{max\_p\_outliers} of observations can be treated as outliers overall.

Fallback when for zero MAD / degeneracy (pearson_fallback). If a column has \mathrm{MAD}=0 or the robust denominator becomes zero, the following rules apply:

  • "none" when correlations involving that column are NA (diagonal remains 1).

  • "hybrid" when only the affected column switches to Pearson standardisation \bar x_a = (x_a - \overline{x}) / \sqrt{\sum_b (x_b - \overline{x})^2}, yielding the hybrid correlation

    \mathrm{bicor}_{\mathrm{hyb}}(x,y) = \sum_a \bar x_a\,\tilde y_a,

    with the other column still robust-standardised.

  • "all" when all columns use ordinary Pearson standardisation; the result equals stats::cor(..., method="pearson") when the NA policy matches.

Handling missing values (na_method).

  • "error" (default): inputs must be finite; this yields a symmetric, positive semidefinite (PSD) matrix since R = \tilde X^\top \tilde X.

  • "pairwise": each R_{jk} is computed on the intersection of rows where both columns are finite. Pairs with fewer than 5 overlapping rows return NA (guarding against instability). Pairwise deletion can break PSD, as in the Pearson case.

Row weights (w). When w is supplied (non-negative, length m), the weighted median \mathrm{med}_w(x) and weighted MAD \mathrm{MAD}_w(x) = \mathrm{med}_w(|x - \mathrm{med}_w(x)|) are used to form u. The Tukey weights are then multiplied by the observation weights prior to normalisation:

\tilde x_a = \frac{(x_a - \mathrm{med}_w(x))\,w_a\,w^{(\mathrm{obs})}_a}{ \sqrt{\sum_b \big[(x_b - \mathrm{med}_w(x))\,w_b\,w^{(\mathrm{obs})}_b\big]^2}},

where w^{(\mathrm{obs})}_a \ge 0 are the user-supplied row weights and w_a are the Tukey biweights built from the weighted median/MAD. Weighted pairwise behaves analogously on each column pair's overlap.

MAD choice (mad_consistent). Setting mad_consistent = TRUE multiplies the raw MAD by 1.4826 inside u. Equivalently, it uses an effective tuning constant c^\star = c \times 1.4826. The default FALSE reproduces the convention in Langfelder & Horvath (2012).

Optional sparsification (sparse_threshold). If provided, entries with |r| < \text{sparse\_threshold} are set to 0 and the result is returned as a "ddiMatrix" (diagonal is forced to 1). This is a post-processing step that does not alter the per-pair estimates.

Computation and threads. Columns are robust-standardised in parallel and the matrix is formed as R = \tilde X^\top \tilde X. n_threads selects the number of OpenMP threads; by default it uses getOption("matrixCorr.threads", 1L).

Large-sample inference. For a pairwise estimate r computed from n_{jk} observed rows, the standard large-sample summaries use

Z_{jk} = \frac{1}{2}\log\!\left(\frac{1 + r}{1 - r}\right)\sqrt{n_{jk} - 2}

and

T_{jk} = \sqrt{n_{jk} - 2}\;\frac{|r|}{\sqrt{1-r^2}}.

The reported p-value is the two-sided Student-t tail probability with n_{jk}-2 degrees of freedom. When ci = TRUE, the package also reports an approximate Fisher-z confidence interval obtained from

z_{jk} = \operatorname{atanh}(r), \qquad \operatorname{SE}(z_{jk}) = \frac{1}{\sqrt{n_{jk}-3}},

followed by back-transformation with tanh(). Confidence intervals are currently available only for dense, unweighted outputs.

Basic properties. \mathrm{bicor}(a x + b,\; c y + d) = \mathrm{sign}(ac)\,\mathrm{bicor}(x,y). With no missing data (and with per-column hybrid/robust standardisation), the output is symmetric and PSD. As with Pearson, affine equivariance does not hold for the associated biweight midcovariance.

Value

A symmetric correlation matrix with class bicor (or a dgCMatrix if sparse_threshold is used), with attributes: method = "biweight_mid_correlation", description, and package = "matrixCorr". Downstream code should be prepared to handle either a dense numeric matrix or a sparse dgCMatrix. When ci = TRUE, the object also carries a ci attribute with elements est, lwr.ci, upr.ci, conf.level, and ci.method, together with an inference attribute containing the standard large-sample summary matrices estimate, statistic, p_value, Z, and n_obs. Pairwise complete-case counts are stored in attr(x, "diagnostics")$n_complete. Internally, all medians/MADs, Tukey weights, optional pairwise-NA handling, and OpenMP loops are implemented in the C++ helpers (bicor_*_cpp()), so the R wrapper mostly validates arguments and dispatches to the appropriate backend.

Invisibly returns x.

A ggplot object.

Author(s)

Thiago de Paula Oliveira

References

Langfelder, P. & Horvath, S. (2012). Fast R Functions for Robust Correlations and Hierarchical Clustering. Journal of Statistical Software, 46(11), 1-17. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v046.i11")}

Examples

set.seed(1)
X <- matrix(rnorm(2000 * 40), 2000, 40)
R <- bicor(X, c_const = 9, max_p_outliers = 1,
                       pearson_fallback = "hybrid")
print(attr(R, "method"))
summary(R)
R_ci <- bicor(X[, 1:5], ci = TRUE)
summary(R_ci)

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(R)
}


matrixCorr documentation built on April 18, 2026, 5:06 p.m.