3. Robust and High-Dimensional Correlation
In matrixCorr: Collection of Correlation and Association Estimators

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)

Scope

This vignette covers the estimators used when ordinary wide-data correlation is not enough. The package provides two broad extensions:

robust estimators for outlier-prone data;
regularised estimators for partial correlation or p >= n settings.

The relevant functions are:

bicor()
pbcor()
wincor()
skipped_corr()
shrinkage_corr()
pcorr()

Robust correlation

Outliers can distort ordinary correlation severely, especially in moderate sample sizes. The robust estimators in matrixCorr are designed to limit that effect, but they do so in different ways.

library(matrixCorr)

set.seed(20)
Y <- data.frame(
  x1 = rnorm(60),
  x2 = rnorm(60),
  x3 = rnorm(60),
  x4 = rnorm(60)
)

idx <- sample.int(nrow(Y), 5)
Y$x1[idx] <- Y$x1[idx] + 8
Y$x2[idx] <- Y$x2[idx] - 6

R_pear <- pearson_corr(Y)
R_bicor <- bicor(Y)
R_pb <- pbcor(Y)
R_win <- wincor(Y)
R_skip <- skipped_corr(Y)

summary(R_pear)
summary(R_bicor)

In practical terms:

bicor() is often a good first robust alternative for wide data.
pbcor() and wincor() follow classical robustification routes and are easy to compare against ordinary correlation.
skipped_corr() is more aggressive because it first removes bivariate outliers pair by pair.

That last property is useful, but it also makes skipped_corr() the most computationally expensive estimator in this group.

summary(R_skip)

Inference for robust estimators

Inference is not uniform across the robust estimators, so it is worth stating the current scope explicitly.

bicor() supports large-sample confidence intervals.
pbcor() supports percentile-bootstrap confidence intervals and method-specific p-values.
wincor() supports percentile-bootstrap confidence intervals and method-specific p-values.
skipped_corr() supports bootstrap confidence intervals and bootstrap p-values.

fit_bicor_ci <- bicor(Y, ci = TRUE)
summary(fit_bicor_ci)

fit_pb_inf <- pbcor(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1)
summary(fit_pb_inf)

fit_win_inf <- wincor(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1)
summary(fit_win_inf)

fit_skip_inf <- skipped_corr(Y, ci = TRUE, p_value = TRUE, n_boot = 200, seed = 1)
summary(fit_skip_inf)

For pbcor(), wincor(), and skipped_corr(), the inferential layer is more computationally demanding than the estimate-only path because it is built from bootstrap resampling. The skipped-correlation route remains the most expensive of the three because each bootstrap sample also repeats the pairwise outlier screening step. The vignette uses n_boot = 200 only to keep runtime modest; substantive analyses usually need more resamples.

Partial correlation

Partial correlation addresses a different target. Instead of summarising raw association, it estimates conditional association after accounting for the remaining variables.

set.seed(21)
n <- 300
x2 <- rnorm(n)
x1 <- 0.8 * x2 + rnorm(n, sd = 0.4)
x3 <- 0.8 * x2 + rnorm(n, sd = 0.4)
x4 <- 0.7 * x3 + rnorm(n, sd = 0.5)
x5 <- rnorm(n)
x6 <- rnorm(n)
X <- data.frame(x1, x2, x3, x4, x5, x6)

fit_pcor_sample <- pcorr(X, method = "sample")
fit_pcor_oas <- pcorr(X, method = "oas")
R_raw <- pearson_corr(X)

round(c(
  raw_x1_x3 = R_raw["x1", "x3"],
  partial_x1_x3 = fit_pcor_sample$pcor["x1", "x3"]
), 2)

print(fit_pcor_sample, digits = 2)
summary(fit_pcor_oas)

Here x1 and x3 are strongly associated in the raw correlation matrix because both depend on x2. Partial correlation is useful because it can separate that indirect association from the remaining direct structure.

The choice of method is important:

"sample" is the ordinary full-rank route.
"oas" and "ridge" stabilise estimation in higher-dimensional settings.
"glasso" is appropriate when a sparse precision structure is the target.

When the classical sample estimator is appropriate, partial correlation also supports p-values and Fisher-z confidence intervals.

Unlike the other correlation wrappers, pcorr() uses return_p_value = TRUE rather than p_value = TRUE. That is intentional: the p-value matrix is an optional returned component of the fitted list object.

fit_pcor_inf <- pcorr(
  X,
  method = "sample",
  return_p_value = TRUE,
  ci = TRUE
)

summary(fit_pcor_inf)

That inference layer is available only for the ordinary sample estimator in the low-dimensional setting. The regularised estimators are primarily estimation tools rather than direct inferential procedures in the current interface.

High-dimensional shrinkage correlation

When the number of variables is large relative to the sample size, direct sample correlation can become unstable. shrinkage_corr() provides a shrinkage correlation route designed for exactly this setting.

set.seed(22)
n <- 40
p_block <- 30

make_block <- function(f) {
  sapply(seq_len(p_block), function(j) 0.8 * f + rnorm(n, sd = 0.8))
}

f1 <- rnorm(n)
f2 <- rnorm(n)
f3 <- rnorm(n)

Xd <- cbind(make_block(f1), make_block(f2), make_block(f3))
p <- ncol(Xd)
colnames(Xd) <- paste0("G", seq_len(p))

R_raw <- stats::cor(Xd)
fit_shr <- shrinkage_corr(Xd)
block <- rep(1:3, each = p_block)
within <- outer(block, block, "==") & upper.tri(R_raw)
between <- outer(block, block, "!=") & upper.tri(R_raw)

round(c(
  raw_within = mean(abs(R_raw[within])),
  raw_between = mean(abs(R_raw[between])),
  shrink_within = mean(abs(fit_shr[within])),
  shrink_between = mean(abs(fit_shr[between]))
), 3)

print(fit_shr, digits = 2, max_rows = 6, max_vars = 6)
summary(fit_shr)

The shrinkage estimator is not intended to reproduce the raw sample matrix. Its purpose is to provide a better-conditioned and more stable estimate in settings where the sample matrix is noisy or singular. In this block-factor simulation, the raw sample matrix retains the main within-block pattern but also carries substantial between-block noise; shrinkage pulls that background noise down without erasing the dominant structure.

Choosing among these methods

The choice is usually governed by the main source of difficulty in the data.

If the concern is outliers, start with bicor() and compare with pbcor(), wincor(), or skipped_corr() as needed.
If the concern is conditional dependence, use pcorr().
If the concern is p >= n or unstable covariance estimation, use shrinkage_corr() or a regularised pcorr() method.