pearson_corr: Pairwise Pearson correlation

View source: R/pearson_corr.R

pearson_corrR Documentation

Pairwise Pearson correlation

Description

Computes all pairwise Pearson correlation coefficients for the numeric columns of a matrix or data frame using a high-performance 'C++' backend.

This function uses a direct Pearson formula implementation in 'C++' to achieve fast and scalable correlation computations, especially for large datasets.

Prints a summary of the Pearson correlation matrix, including description and method metadata.

Generates a ggplot2-based heatmap of the Pearson correlation matrix.

Usage

pearson_corr(data)

## S3 method for class 'pearson_corr'
print(x, digits = 4, max_rows = NULL, max_cols = NULL, ...)

## S3 method for class 'pearson_corr'
plot(
  x,
  title = "Pearson correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  ...
)

Arguments

data

A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded. Each column must have at least two non-missing values and contain no NAs.

x

An object of class pearson_corr.

digits

Integer; number of decimal places to print in the concordance

max_rows

Optional integer; maximum number of rows to display. If NULL, all rows are shown.

max_cols

Optional integer; maximum number of columns to display. If NULL, all columns are shown.

...

Additional arguments passed to ggplot2::theme() or other ggplot2 layers.

title

Plot title. Default is "Pearson correlation heatmap".

low_color

Color for the minimum correlation. Default is "indianred1".

high_color

Color for the maximum correlation. Default is "steelblue1".

mid_color

Color for zero correlation. Default is "white".

value_text_size

Font size for displaying correlation values. Default is 4.

Details

Statistical formulation. Let X \in \mathbb{R}^{n \times p} be a numeric matrix with rows as observations and columns as variables, and let \mu = \tfrac{1}{n}\mathbf{1}^\top X be the vector of column means. Define the centred cross-product matrix

S \;=\; (X - \mathbf{1}\mu)^\top (X - \mathbf{1}\mu) \;=\; X^\top X \;-\; n\,\mu\,\mu^\top.

The (unbiased) sample covariance is then

\widehat{\Sigma} \;=\; \tfrac{1}{n-1}\,S,

and the vector of sample standard deviations is

s_i \;=\; \sqrt{\widehat{\Sigma}_{ii}}, \qquad i=1,\ldots,p.

The Pearson correlation matrix R is obtained by standardising \widehat{\Sigma}, given by:

R \;=\; D^{-1/2}\,\widehat{\Sigma}\,D^{-1/2}, \qquad D \;=\; \mathrm{diag}(\widehat{\Sigma}_{11},\ldots,\widehat{\Sigma}_{pp}),

equivalently, entrywise

R_{ij} \;=\; \frac{\widehat{\Sigma}_{ij}}{s_i\,s_j}, \quad i \neq j, \qquad R_{ii} \;=\; 1.

If s_i = 0 (zero variance), the i-th row and column are set to NA. Tiny negative values on the covariance diagonal caused by floating-point rounding are reduced to zero before taking square roots. No missing values are permitted in X.

The implementation forms X^\top X via a symmetric rank-k update ('BLAS' 'SYRK') on the upper triangle, then applies the rank-1 correction -\,n\,\mu\,\mu^\top to obtain S without explicitly materialising X - \mathbf{1}\mu. After scaling by 1/(n-1), triangular normalisation by D^{-1/2} yields R, which is then symmetrised. The dominant cost is O(n p^2) flops with O(p^2) memory.

This implementation avoids calculating means explicitly and instead uses a numerically stable form.

Value

A symmetric numeric matrix where the (i, j)-th element is the Pearson correlation between the i-th and j-th numeric columns of the input.

Invisibly returns the pearson_corr object.

A ggplot object representing the heatmap.

Note

Missing values are not allowed. Columns with fewer than two observations are excluded.

Author(s)

Thiago de Paula Oliveira toliveira@abacusbio.com

Thiago de Paula Oliveira

References

Pearson, K. (1895). "Notes on regression and inheritance in the case of two parents". Proceedings of the Royal Society of London, 58, 240–242.

See Also

print.pearson_corr, plot.pearson_corr

Examples

## MVN with AR(1) correlation
set.seed(123)
p <- 6; n <- 300; rho <- 0.5
# true correlation
Sigma <- rho^abs(outer(seq_len(p), seq_len(p), "-"))
L <- chol(Sigma)
# MVN(n, 0, Sigma)
X <- matrix(rnorm(n * p), n, p) %*% L
colnames(X) <- paste0("V", seq_len(p))

pr <- pearson_corr(X)
print(pr, digits = 2)
plot(pr)

## Compare the sample estimate to the truth
Rhat <- cor(X)
# estimated
round(Rhat[1:4, 1:4], 2)
# true
round(Sigma[1:4, 1:4], 2)
off <- upper.tri(Sigma, diag = FALSE)
# MAE on off-diagonals
mean(abs(Rhat[off] - Sigma[off]))

## Larger n reduces sampling error
n2 <- 2000
X2 <- matrix(rnorm(n2 * p), n2, p) %*% L
Rhat2 <- cor(X2)
off <- upper.tri(Sigma, diag = FALSE)
## mean absolute error (MAE) of the off-diagonal correlations
mean(abs(Rhat2[off] - Sigma[off]))


matrixCorr documentation built on Aug. 26, 2025, 5:07 p.m.