bicor: Biweight mid-correlation (bicor)
In matrixCorr: Collection of Correlation and Association Estimators

bicor

R Documentation

Biweight mid-correlation (bicor)

Description

Computes pairwise biweight mid-correlations for numeric data. Bicor is a robust, Pearson-like correlation that down-weights outliers and heavy-tailed observations. Optional large-sample confidence intervals are available as a derived feature.

Usage

bicor(
  data,
  na_method = c("error", "pairwise"),
  ci = FALSE,
  conf_level = 0.95,
  n_threads = getOption("matrixCorr.threads", 1L),
  output = c("matrix", "sparse", "edge_list"),
  threshold = 0,
  diag = TRUE,
  c_const = 9,
  max_p_outliers = 1,
  pearson_fallback = c("hybrid", "none", "all"),
  mad_consistent = FALSE,
  w = NULL,
  sparse_threshold = NULL
)

diag.bicor(x, ...)

## S3 method for class 'bicor'
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  show_ci = NULL,
  na_print = "NA",
  ...
)

## S3 method for class 'bicor'
plot(
  x,
  title = "Biweight mid-correlation heatmap",
  reorder = c("none", "hclust"),
  triangle = c("full", "lower", "upper"),
  low_color = "indianred1",
  mid_color = "white",
  high_color = "steelblue1",
  value_text_size = 3,
  ci_text_size = 2.5,
  show_value = TRUE,
  na_fill = "grey90",
  ...
)

## S3 method for class 'bicor'
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  p_digits = 4,
  show_ci = NULL,
  ...
)

## S3 method for class 'summary.bicor'
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Arguments

`data`	A numeric matrix or a data frame containing numeric columns. Factors, logicals and common time classes are dropped in the data-frame path. Missing values are not allowed unless `na_method = "pairwise"`.
`na_method`	One of `"error"` (default, fastest) or `"pairwise"`. With `"pairwise"`, each `(j,k)` correlation is computed on the intersection of non-missing rows for the pair.
`ci`	Logical (default `FALSE`). If `TRUE`, attach approximate Fisher-z confidence intervals for the off-diagonal biweight mid-correlations. This confidence interval is provided as an additional large-sample approximation.
`conf_level`	Confidence level used when `ci = TRUE`. Default is `0.95`.
`n_threads`	Integer `\geq 1`. Number of OpenMP threads. Defaults to `getOption("matrixCorr.threads", 1L)`.
`output`	Output representation for the computed estimates. `"matrix"` (default): full dense matrix; best when you need matrix algebra, dense heatmaps, or full compatibility with existing code. `"sparse"`: sparse matrix from Matrix containing only retained entries; best when many values are dropped by thresholding. `"edge_list"`: long-form data frame with columns `row`, `col`, `value`; convenient for filtering, joins, and network-style workflows.
`threshold`	Non-negative absolute-value filter for non-matrix outputs: keep entries with `abs(value) >= threshold`. Use `threshold > 0` when you want only stronger associations (typically with `output = "sparse"` or `"edge_list"`). Keep `threshold = 0` to retain all values. Must be `0` when `output = "matrix"`.
`diag`	Logical; whether to include diagonal entries in `"sparse"` and `"edge_list"` outputs.
`c_const`	Positive numeric. Tukey biweight tuning constant applied to the raw MAD; default `9` (Langfelder & Horvath's convention).
`max_p_outliers`	Numeric in `(0, 1]`. Optional cap on the maximum proportion of outliers on each side; if `< 1`, side-specific rescaling maps those quantiles to `\|u\|=1`. Use `1` to disable.
`pearson_fallback`	Character scalar indicating the fallback policy. One of: `"hybrid"` (default): if a column has MAD = 0, that column uses Pearson standardisation, yielding a hybrid correlation. `"none"`: return `NA` if a column has MAD = 0 or becomes degenerate after weighting. `"all"`: force ordinary Pearson for all columns.
`mad_consistent`	Logical; if `TRUE`, use the normal-consistent MAD (`MAD_raw * 1.4826`) in the bicor weights. Default `FALSE` to match Langfelder & Horvath (2012).
`w`	Optional non-negative numeric vector of length `nrow(data)` giving row weights. When supplied, weighted medians/MADs are used and Tukey weights are multiplied by `w` before normalisation.
`sparse_threshold`	Optional numeric `\geq 0`. If supplied, sets entries with `\|r\| < sparse_threshold` to 0 and returns a sparse `"ddiMatrix"` (requires Matrix).
`x`	An object of class `summary.bicor`.
`...`	Additional arguments passed to `ggplot2::theme()` or other layers.
`digits`	Integer; number of decimal places used for the matrix.
`n`	Optional row threshold for compact preview output.
`topn`	Optional number of leading/trailing rows to show when truncated.
`max_vars`	Optional maximum number of visible columns; `NULL` derives this from console width.
`width`	Optional display width; defaults to `getOption("width")`.
`ci_digits`	Integer; digits for bicor confidence limits in the pairwise summary.
`show_ci`	One of `"yes"` or `"no"`.
`na_print`	Character; how to display missing values.
`title`	Plot title. Default is `"Biweight mid-correlation heatmap"`.
`reorder`	Character; one of `"none"` (default) or `"hclust"`. If `"hclust"`, variables are reordered by complete-linkage clustering on the distance `d = 1 - r`, after replacing `NA` by 0 for clustering purposes only.
`triangle`	One of `"full"` (default), `"lower"`, or `"upper"` to display the full matrix or a single triangle.
`low_color`, `mid_color`, `high_color`	Colours for the gradient in `scale_fill_gradient2`. Defaults are `"indianred1"`, `"white"`, `"steelblue1"`.
`value_text_size`	Numeric; font size for cell labels. Set to `NULL` to suppress labels (recommended for large matrices).
`ci_text_size`	Text size for confidence-interval labels in the heatmap.
`show_value`	Logical; if `TRUE` (default), overlay numeric values on the heatmap tiles.
`na_fill`	Fill colour for `NA` cells. Default `"grey90"`.
`object`	An object of class `bicor`.
`p_digits`	Integer; digits for bicor p-values in the pairwise summary.

Details

For a column x = (x_a)_{a=1}^m, let \mathrm{med}(x) be the median and \mathrm{MAD}(x) = \mathrm{med}(|x - \mathrm{med}(x)|) the (raw) median absolute deviation. If mad_consistent = TRUE, the consistent scale \mathrm{MAD}^\star(x) = 1.4826\,\mathrm{MAD}(x) is used. With tuning constant c>0, define

u_a = \frac{x_a - \mathrm{med}(x)}{c\,\mathrm{MAD}^{(\star)}(x)}.

The Tukey biweight gives per-observation weights

w_a = (1 - u_a^2)^2\,\mathbf{1}\{|u_a| < 1\}.

Robust standardisation of a column is

\tilde x_a = \frac{(x_a - \mathrm{med}(x))\,w_a}{ \sqrt{\sum_{b=1}^m \big[(x_b - \mathrm{med}(x))\,w_b\big]^2}}.

For two columns x,y, the biweight mid-correlation is

\mathrm{bicor}(x,y) = \sum_{a=1}^m \tilde x_a\,\tilde y_a \in [-1,1].

Capping the maximum proportion of outliers (max_p_outliers). If max_p_outliers < 1, let q_L = Q_x(\text{max\_p\_outliers}) and q_U = Q_x(1-\text{max\_p\_outliers}) be the lower/upper quantiles of x. If the corresponding |u| at either quantile exceeds 1, u is rescaled separately on the negative and positive sides so that those quantiles land at |u|=1. This guarantees that all observations between the two quantiles receive positive weight. Note the bound applies per side, so up to 2\,\text{max\_p\_outliers} of observations can be treated as outliers overall.

Fallback when for zero MAD / degeneracy (pearson_fallback). If a column has \mathrm{MAD}=0 or the robust denominator becomes zero, the following rules apply:

"none" when correlations involving that column are NA (diagonal remains 1).
"hybrid" when only the affected column switches to Pearson standardisation \bar x_a = (x_a - \overline{x}) / \sqrt{\sum_b (x_b - \overline{x})^2}, yielding the hybrid correlation

\mathrm{bicor}_{\mathrm{hyb}}(x,y) = \sum_a \bar x_a\,\tilde y_a,

with the other column still robust-standardised.
"all" when all columns use ordinary Pearson standardisation; the result equals stats::cor(..., method="pearson") when the NA policy matches.

Handling missing values (na_method).

"error" (default): inputs must be finite; this yields a symmetric, positive semidefinite (PSD) matrix since R = \tilde X^\top \tilde X.
"pairwise": each R_{jk} is computed on the intersection of rows where both columns are finite. Pairs with fewer than 5 overlapping rows return NA (guarding against instability). Pairwise deletion can break PSD, as in the Pearson case.

Row weights (w). When w is supplied (non-negative, length m), the weighted median \mathrm{med}_w(x) and weighted MAD \mathrm{MAD}_w(x) = \mathrm{med}_w(|x - \mathrm{med}_w(x)|) are used to form u. The Tukey weights are then multiplied by the observation weights prior to normalisation:

\tilde x_a = \frac{(x_a - \mathrm{med}_w(x))\,w_a\,w^{(\mathrm{obs})}_a}{ \sqrt{\sum_b \big[(x_b - \mathrm{med}_w(x))\,w_b\,w^{(\mathrm{obs})}_b\big]^2}},

where w^{(\mathrm{obs})}_a \ge 0 are the user-supplied row weights and w_a are the Tukey biweights built from the weighted median/MAD. Weighted pairwise behaves analogously on each column pair's overlap.

MAD choice (mad_consistent). Setting mad_consistent = TRUE multiplies the raw MAD by 1.4826 inside u. Equivalently, it uses an effective tuning constant c^\star = c \times 1.4826. The default FALSE reproduces the convention in Langfelder & Horvath (2012).

Optional sparsification (sparse_threshold). If provided, entries with |r| < \text{sparse\_threshold} are set to 0 and the result is returned as a "ddiMatrix" (diagonal is forced to 1). This is a post-processing step that does not alter the per-pair estimates.

Computation and threads. Columns are robust-standardised in parallel and the matrix is formed as R = \tilde X^\top \tilde X. n_threads selects the number of OpenMP threads; by default it uses getOption("matrixCorr.threads", 1L).

Large-sample inference. For a pairwise estimate r computed from n_{jk} observed rows, the standard large-sample summaries use

Z_{jk} = \frac{1}{2}\log\!\left(\frac{1 + r}{1 - r}\right)\sqrt{n_{jk} - 2}

and

T_{jk} = \sqrt{n_{jk} - 2}\;\frac{|r|}{\sqrt{1-r^2}}.

The reported p-value is the two-sided Student-t tail probability with n_{jk}-2 degrees of freedom. When ci = TRUE, the package also reports an approximate Fisher-z confidence interval obtained from

z_{jk} = \operatorname{atanh}(r), \qquad \operatorname{SE}(z_{jk}) = \frac{1}{\sqrt{n_{jk}-3}},

followed by back-transformation with tanh(). Confidence intervals are currently available only for dense, unweighted outputs.

Basic properties. \mathrm{bicor}(a x + b,\; c y + d) = \mathrm{sign}(ac)\,\mathrm{bicor}(x,y). With no missing data (and with per-column hybrid/robust standardisation), the output is symmetric and PSD. As with Pearson, affine equivariance does not hold for the associated biweight midcovariance.

Value

A symmetric correlation matrix with class bicor (or a dgCMatrix if sparse_threshold is used), with attributes: method = "biweight_mid_correlation", description, and package = "matrixCorr". Downstream code should be prepared to handle either a dense numeric matrix or a sparse dgCMatrix. When ci = TRUE, the object also carries a ci attribute with elements est, lwr.ci, upr.ci, conf.level, and ci.method, together with an inference attribute containing the standard large-sample summary matrices estimate, statistic, p_value, Z, and n_obs. Pairwise complete-case counts are stored in attr(x, "diagnostics")$n_complete. Internally, all medians/MADs, Tukey weights, optional pairwise-NA handling, and OpenMP loops are implemented in the C++ helpers (bicor_*_cpp()), so the R wrapper mostly validates arguments and dispatches to the appropriate backend.

Invisibly returns x.

A ggplot object.

Author(s)

Thiago de Paula Oliveira

References

Langfelder, P. & Horvath, S. (2012). Fast R Functions for Robust Correlations and Hierarchical Clustering. Journal of Statistical Software, 46(11), 1-17. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.18637/jss.v046.i11")}

Examples

set.seed(1)
X <- matrix(rnorm(2000 * 40), 2000, 40)
R <- bicor(X, c_const = 9, max_p_outliers = 1,
                       pearson_fallback = "hybrid")
print(attr(R, "method"))
summary(R)
R_ci <- bicor(X[, 1:5], ci = TRUE)
summary(R_ci)

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(R)
}

matrixCorr documentation built on April 18, 2026, 5:06 p.m.