pearson_corr: Pairwise Pearson correlation
In matrixCorr: Collection of Correlation and Association Estimators

View source: R/pearson_corr.R

pearson_corr

R Documentation

Pairwise Pearson correlation

Description

Computes all pairwise Pearson correlation coefficients for the numeric columns of a matrix or data frame using a high-performance 'C++' backend.

This function uses a direct Pearson formula implementation in 'C++' to achieve fast and scalable correlation computations, especially for large datasets.

Prints a summary of the Pearson correlation matrix, including description and method metadata.

Generates a ggplot2-based heatmap of the Pearson correlation matrix.

Usage

pearson_corr(data)

## S3 method for class 'pearson_corr'
print(x, digits = 4, max_rows = NULL, max_cols = NULL, ...)

## S3 method for class 'pearson_corr'
plot(
  x,
  title = "Pearson correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  ...
)

Arguments

`data`	A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded. Each column must have at least two non-missing values and contain no NAs.
`x`	An object of class `pearson_corr`.
`digits`	Integer; number of decimal places to print in the concordance
`max_rows`	Optional integer; maximum number of rows to display. If `NULL`, all rows are shown.
`max_cols`	Optional integer; maximum number of columns to display. If `NULL`, all columns are shown.
`...`	Additional arguments passed to `ggplot2::theme()` or other `ggplot2` layers.
`title`	Plot title. Default is `"Pearson correlation heatmap"`.
`low_color`	Color for the minimum correlation. Default is `"indianred1"`.
`high_color`	Color for the maximum correlation. Default is `"steelblue1"`.
`mid_color`	Color for zero correlation. Default is `"white"`.
`value_text_size`	Font size for displaying correlation values. Default is `4`.

Details

Statistical formulation. Let X \in \mathbb{R}^{n \times p} be a numeric matrix with rows as observations and columns as variables, and let \mu = \tfrac{1}{n}\mathbf{1}^\top X be the vector of column means. Define the centred cross-product matrix

S \;=\; (X - \mathbf{1}\mu)^\top (X - \mathbf{1}\mu) \;=\; X^\top X \;-\; n\,\mu\,\mu^\top.

The (unbiased) sample covariance is then

\widehat{\Sigma} \;=\; \tfrac{1}{n-1}\,S,

and the vector of sample standard deviations is

s_i \;=\; \sqrt{\widehat{\Sigma}_{ii}}, \qquad i=1,\ldots,p.

The Pearson correlation matrix R is obtained by standardising \widehat{\Sigma}, given by:

R \;=\; D^{-1/2}\,\widehat{\Sigma}\,D^{-1/2}, \qquad D \;=\; \mathrm{diag}(\widehat{\Sigma}_{11},\ldots,\widehat{\Sigma}_{pp}),

equivalently, entrywise

R_{ij} \;=\; \frac{\widehat{\Sigma}_{ij}}{s_i\,s_j}, \quad i \neq j, \qquad R_{ii} \;=\; 1.

If s_i = 0 (zero variance), the i-th row and column are set to NA. Tiny negative values on the covariance diagonal caused by floating-point rounding are reduced to zero before taking square roots. No missing values are permitted in X.

The implementation forms X^\top X via a symmetric rank-k update ('BLAS' 'SYRK') on the upper triangle, then applies the rank-1 correction -\,n\,\mu\,\mu^\top to obtain S without explicitly materialising X - \mathbf{1}\mu. After scaling by 1/(n-1), triangular normalisation by D^{-1/2} yields R, which is then symmetrised. The dominant cost is O(n p^2) flops with O(p^2) memory.

This implementation avoids calculating means explicitly and instead uses a numerically stable form.

Value

A symmetric numeric matrix where the (i, j)-th element is the Pearson correlation between the i-th and j-th numeric columns of the input.

Invisibly returns the pearson_corr object.

A ggplot object representing the heatmap.

Note

Missing values are not allowed. Columns with fewer than two observations are excluded.

Author(s)

Thiago de Paula Oliveira toliveira@abacusbio.com

Thiago de Paula Oliveira

References

Pearson, K. (1895). "Notes on regression and inheritance in the case of two parents". Proceedings of the Royal Society of London, 58, 240–242.

Examples

## MVN with AR(1) correlation
set.seed(123)
p <- 6; n <- 300; rho <- 0.5
# true correlation
Sigma <- rho^abs(outer(seq_len(p), seq_len(p), "-"))
L <- chol(Sigma)
# MVN(n, 0, Sigma)
X <- matrix(rnorm(n * p), n, p) %*% L
colnames(X) <- paste0("V", seq_len(p))

pr <- pearson_corr(X)
print(pr, digits = 2)
plot(pr)

## Compare the sample estimate to the truth
Rhat <- cor(X)
# estimated
round(Rhat[1:4, 1:4], 2)
# true
round(Sigma[1:4, 1:4], 2)
off <- upper.tri(Sigma, diag = FALSE)
# MAE on off-diagonals
mean(abs(Rhat[off] - Sigma[off]))

## Larger n reduces sampling error
n2 <- 2000
X2 <- matrix(rnorm(n2 * p), n2, p) %*% L
Rhat2 <- cor(X2)
off <- upper.tri(Sigma, diag = FALSE)
## mean absolute error (MAE) of the off-diagonal correlations
mean(abs(Rhat2[off] - Sigma[off]))

matrixCorr documentation built on Aug. 26, 2025, 5:07 p.m.