pearson_corr: Pairwise Pearson correlation

View source: R/pearson_corr.R

pearson_corrR Documentation

Pairwise Pearson correlation

Description

Computes pairwise Pearson correlations for the numeric columns of a matrix or data frame using a high-performance 'C++' backend. Optional Fisher-z confidence intervals are available.

Usage

pearson_corr(
  data,
  na_method = c("error", "pairwise"),
  ci = FALSE,
  conf_level = 0.95,
  n_threads = getOption("matrixCorr.threads", 1L),
  output = c("matrix", "sparse", "edge_list"),
  threshold = 0,
  diag = TRUE,
  ...
)

## S3 method for class 'pearson_corr'
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  show_ci = NULL,
  ...
)

## S3 method for class 'pearson_corr'
plot(
  x,
  title = "Pearson correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  ci_text_size = 3,
  show_value = TRUE,
  ...
)

## S3 method for class 'pearson_corr'
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  show_ci = NULL,
  ...
)

## S3 method for class 'summary.pearson_corr'
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Arguments

data

A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded. Each column must have at least two non-missing values.

na_method

Character scalar controlling missing-data handling. "error" rejects missing, NaN, and infinite values. "pairwise" recomputes each correlation on its own pairwise complete-case overlap.

ci

Logical (default FALSE). If TRUE, attach pairwise Fisher-z confidence intervals for the off-diagonal Pearson correlations.

conf_level

Confidence level used when ci = TRUE. Default is 0.95.

n_threads

Integer \geq 1. Number of OpenMP threads. Defaults to getOption("matrixCorr.threads", 1L).

output

Output representation for the computed estimates.

  • "matrix" (default): full dense matrix; best when you need matrix algebra, dense heatmaps, or full compatibility with existing code.

  • "sparse": sparse matrix from Matrix containing only retained entries; best when many values are dropped by thresholding.

  • "edge_list": long-form data frame with columns row, col, value; convenient for filtering, joins, and network-style workflows.

threshold

Non-negative absolute-value filter for non-matrix outputs: keep entries with abs(value) >= threshold. Use threshold > 0 when you want only stronger associations (typically with output = "sparse" or "edge_list"). Keep threshold = 0 to retain all values. Must be 0 when output = "matrix".

diag

Logical; whether to include diagonal entries in "sparse" and "edge_list" outputs.

...

Additional arguments passed to ggplot2::theme() or other ggplot2 layers.

x

An object of class summary.pearson_corr.

digits

Integer; number of decimal places to print in the concordance

n

Optional row threshold for compact preview output.

topn

Optional number of leading/trailing rows to show when truncated.

max_vars

Optional maximum number of visible columns; NULL derives this from console width.

width

Optional display width; defaults to getOption("width").

ci_digits

Integer; digits for Pearson confidence limits in the pairwise summary.

show_ci

One of "yes" or "no".

title

Plot title. Default is "Pearson correlation heatmap".

low_color

Color for the minimum correlation. Default is "indianred1".

high_color

Color for the maximum correlation. Default is "steelblue1".

mid_color

Color for zero correlation. Default is "white".

value_text_size

Font size for displaying correlation values. Default is 4.

ci_text_size

Text size for confidence intervals in the heatmap.

show_value

Logical; if TRUE (default), overlay numeric values on the heatmap tiles.

object

An object of class pearson_corr.

Details

Let X \in \mathbb{R}^{n \times p} be a numeric matrix with rows as observations and columns as variables, and let \mathbf{1} \in \mathbb{R}^n denote the all-ones vector. Define the column means \mu = (1/n)\,\mathbf{1}^\top X and the centred cross-product matrix

S \;=\; (X - \mathbf{1}\mu)^\top (X - \mathbf{1}\mu) \;=\; X^\top \!\Big(I_n - \tfrac{1}{n}\mathbf{1}\mathbf{1}^\top\Big) X \;=\; X^\top X \;-\; n\,\mu\,\mu^\top.

The (unbiased) sample covariance is

\widehat{\Sigma} \;=\; \tfrac{1}{n-1}\,S,

and the sample standard deviations are s_i = \sqrt{\widehat{\Sigma}_{ii}}. The Pearson correlation matrix is obtained by standardising \widehat{\Sigma}, and it is given by

R \;=\; D^{-1/2}\,\widehat{\Sigma}\,D^{-1/2}, \qquad D \;=\; \mathrm{diag}(\widehat{\Sigma}_{11},\ldots,\widehat{\Sigma}_{pp}),

equivalently, entrywise R_{ij} = \widehat{\Sigma}_{ij}/(s_i s_j) for i \neq j and R_{ii} = 1. With 1/(n-1) scaling, \widehat{\Sigma} is unbiased for the covariance; the induced correlations are biased in finite samples.

The implementation forms X^\top X via a BLAS symmetric rank-k update (SYRK) on the upper triangle, then applies the rank-1 correction -\,n\,\mu\,\mu^\top to obtain S without explicitly materialising X - \mathbf{1}\mu. After scaling by 1/(n-1), triangular normalisation by D^{-1/2} yields R, which is then symmetrised to remove round-off asymmetry. Tiny negative values on the covariance diagonal due to floating-point rounding are truncated to zero before taking square roots.

If a variable has zero variance (s_i = 0), the corresponding row and column of R are set to NA. When na_method = "pairwise", each (i,j) correlation is recomputed on the pairwise complete-case overlap of columns i and j.

When ci = TRUE, Fisher-z confidence intervals are computed from the observed pairwise Pearson correlation r_{ij} and the pairwise complete-case sample size n_{ij}:

z_{ij} = \operatorname{atanh}(r_{ij}), \qquad \operatorname{SE}(z_{ij}) = \frac{1}{\sqrt{n_{ij} - 3}}.

With z_{1-\alpha/2} = \Phi^{-1}(1 - \alpha/2), the confidence limits are

\tanh\!\bigl(z_{ij} - z_{1-\alpha/2}\operatorname{SE}(z_{ij})\bigr) \;\;\text{and}\;\; \tanh\!\bigl(z_{ij} + z_{1-\alpha/2}\operatorname{SE}(z_{ij})\bigr).

Confidence intervals are reported only when n_{ij} > 3.

Computational complexity. The dominant cost is O(n p^2) flops with O(p^2) memory.

Value

A symmetric numeric matrix where the (i, j)-th element is the Pearson correlation between the i-th and j-th numeric columns of the input. When ci = TRUE, the object also carries a ci attribute with elements est, lwr.ci, upr.ci, and conf.level. When pairwise-complete evaluation is used, pairwise sample sizes are stored in attr(x, "diagnostics")$n_complete.

Invisibly returns the pearson_corr object.

A ggplot object representing the heatmap.

Note

Missing values are rejected when na_method = "error". Columns with fewer than two usable observations are excluded.

Author(s)

Thiago de Paula Oliveira

References

Pearson, K. (1895). "Notes on regression and inheritance in the case of two parents". Proceedings of the Royal Society of London, 58, 240-242.

See Also

print.pearson_corr, plot.pearson_corr

Examples

## MVN with AR(1) correlation
set.seed(123)
p <- 6; n <- 300; rho <- 0.5
# true correlation
Sigma <- rho^abs(outer(seq_len(p), seq_len(p), "-"))
L <- chol(Sigma)
# MVN(n, 0, Sigma)
X <- matrix(rnorm(n * p), n, p) %*% L
colnames(X) <- paste0("V", seq_len(p))

pr <- pearson_corr(X)
print(pr, digits = 2)
summary(pr)
plot(pr)

## Compare the sample estimate to the truth
Rhat <- cor(X)
# estimated
round(Rhat[1:4, 1:4], 2)
# true
round(Sigma[1:4, 1:4], 2)
off <- upper.tri(Sigma, diag = FALSE)
# MAE on off-diagonals
mean(abs(Rhat[off] - Sigma[off]))

## Larger n reduces sampling error
n2 <- 2000
X2 <- matrix(rnorm(n2 * p), n2, p) %*% L
Rhat2 <- cor(X2)
off <- upper.tri(Sigma, diag = FALSE)
## mean absolute error (MAE) of the off-diagonal correlations
mean(abs(Rhat2[off] - Sigma[off]))

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(pr)
}


matrixCorr documentation built on April 18, 2026, 5:06 p.m.