pearson_corr | R Documentation |
Computes all pairwise Pearson correlation coefficients for the numeric columns of a matrix or data frame using a high-performance 'C++' backend.
This function uses a direct Pearson formula implementation in 'C++' to achieve fast and scalable correlation computations, especially for large datasets.
Prints a summary of the Pearson correlation matrix, including description and method metadata.
Generates a ggplot2-based heatmap of the Pearson correlation matrix.
pearson_corr(data)
## S3 method for class 'pearson_corr'
print(x, digits = 4, max_rows = NULL, max_cols = NULL, ...)
## S3 method for class 'pearson_corr'
plot(
x,
title = "Pearson correlation heatmap",
low_color = "indianred1",
high_color = "steelblue1",
mid_color = "white",
value_text_size = 4,
...
)
data |
A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded. Each column must have at least two non-missing values and contain no NAs. |
x |
An object of class |
digits |
Integer; number of decimal places to print in the concordance |
max_rows |
Optional integer; maximum number of rows to display.
If |
max_cols |
Optional integer; maximum number of columns to display.
If |
... |
Additional arguments passed to |
title |
Plot title. Default is |
low_color |
Color for the minimum correlation. Default is
|
high_color |
Color for the maximum correlation. Default is
|
mid_color |
Color for zero correlation. Default is |
value_text_size |
Font size for displaying correlation values. Default
is |
Statistical formulation. Let X \in \mathbb{R}^{n \times p}
be a
numeric matrix with rows as observations and columns as variables, and let
\mu = \tfrac{1}{n}\mathbf{1}^\top X
be the vector of column means.
Define the centred cross-product matrix
S \;=\; (X - \mathbf{1}\mu)^\top (X - \mathbf{1}\mu)
\;=\; X^\top X \;-\; n\,\mu\,\mu^\top.
The (unbiased) sample covariance is then
\widehat{\Sigma} \;=\; \tfrac{1}{n-1}\,S,
and the vector of sample standard deviations is
s_i \;=\; \sqrt{\widehat{\Sigma}_{ii}}, \qquad i=1,\ldots,p.
The Pearson correlation matrix R
is obtained by standardising
\widehat{\Sigma}
, given by:
R \;=\; D^{-1/2}\,\widehat{\Sigma}\,D^{-1/2}, \qquad
D \;=\; \mathrm{diag}(\widehat{\Sigma}_{11},\ldots,\widehat{\Sigma}_{pp}),
equivalently, entrywise
R_{ij} \;=\; \frac{\widehat{\Sigma}_{ij}}{s_i\,s_j}, \quad i \neq j,
\qquad R_{ii} \;=\; 1.
If s_i = 0
(zero variance),
the i
-th row and column are set to NA
. Tiny negative values on
the covariance diagonal caused by floating-point rounding are reduced to
zero before taking square roots. No missing values are permitted in X
.
The implementation forms X^\top X
via a
symmetric rank-k
update ('BLAS' 'SYRK') on the upper triangle, then
applies the rank-1 correction -\,n\,\mu\,\mu^\top
to obtain S
without explicitly materialising X - \mathbf{1}\mu
. After scaling by
1/(n-1)
, triangular normalisation by D^{-1/2}
yields R
,
which is then symmetrised. The dominant cost is O(n p^2)
flops with
O(p^2)
memory.
This implementation avoids calculating means explicitly and instead uses a numerically stable form.
A symmetric numeric matrix where the (i, j)
-th element is
the Pearson correlation between the i
-th and j
-th
numeric columns of the input.
Invisibly returns the pearson_corr
object.
A ggplot
object representing the heatmap.
Missing values are not allowed. Columns with fewer than two observations are excluded.
Thiago de Paula Oliveira toliveira@abacusbio.com
Thiago de Paula Oliveira
Pearson, K. (1895). "Notes on regression and inheritance in the case of two parents". Proceedings of the Royal Society of London, 58, 240–242.
print.pearson_corr
, plot.pearson_corr
## MVN with AR(1) correlation
set.seed(123)
p <- 6; n <- 300; rho <- 0.5
# true correlation
Sigma <- rho^abs(outer(seq_len(p), seq_len(p), "-"))
L <- chol(Sigma)
# MVN(n, 0, Sigma)
X <- matrix(rnorm(n * p), n, p) %*% L
colnames(X) <- paste0("V", seq_len(p))
pr <- pearson_corr(X)
print(pr, digits = 2)
plot(pr)
## Compare the sample estimate to the truth
Rhat <- cor(X)
# estimated
round(Rhat[1:4, 1:4], 2)
# true
round(Sigma[1:4, 1:4], 2)
off <- upper.tri(Sigma, diag = FALSE)
# MAE on off-diagonals
mean(abs(Rhat[off] - Sigma[off]))
## Larger n reduces sampling error
n2 <- 2000
X2 <- matrix(rnorm(n2 * p), n2, p) %*% L
Rhat2 <- cor(X2)
off <- upper.tri(Sigma, diag = FALSE)
## mean absolute error (MAE) of the off-diagonal correlations
mean(abs(Rhat2[off] - Sigma[off]))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.