skipped_corr: Pairwise skipped correlation

View source: R/skipcor.R

skipped_corrR Documentation

Pairwise skipped correlation

Description

Computes all pairwise skipped correlation coefficients for the numeric columns of a matrix or data frame using a high-performance 'C++' backend.

Skipped correlation detects bivariate outliers using a projection rule and then computes Pearson or Spearman correlation on the retained observations. It is designed for situations where marginally robust methods can still be distorted by unusual points in the joint data cloud.

Usage

skipped_corr(
  data,
  method = c("pearson", "spearman"),
  na_method = c("error", "pairwise"),
  ci = FALSE,
  p_value = FALSE,
  conf_level = 0.95,
  n_threads = getOption("matrixCorr.threads", 1L),
  return_masks = FALSE,
  stand = TRUE,
  outlier_rule = c("idealf", "mad"),
  cutoff = sqrt(stats::qchisq(0.975, df = 2)),
  n_boot = 2000L,
  p_adjust = c("none", "hochberg", "ecp"),
  fwe_level = 0.05,
  n_mc = 1000L,
  seed = NULL,
  output = c("matrix", "sparse", "edge_list"),
  threshold = 0,
  diag = TRUE
)

skipped_corr_masks(x, var1 = NULL, var2 = NULL)

## S3 method for class 'skipped_corr'
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 4,
  show_ci = NULL,
  show_p = c("auto", "yes", "no"),
  ...
)

## S3 method for class 'skipped_corr'
plot(
  x,
  title = "Skipped correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  ci_text_size = 3,
  show_value = TRUE,
  ...
)

## S3 method for class 'skipped_corr'
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

## S3 method for class 'summary.skipped_corr'
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Arguments

data

A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded.

method

Correlation computed after removing projected outliers. One of "pearson" (default) or "spearman".

na_method

One of "error" (default) or "pairwise". With "error", the function requires all retained numeric columns to be free of missing or non-finite values and aborts otherwise. This is the recommended setting when you want a single common sample size across all pairs, reproducible skipped-row diagnostics on the same rows, or bootstrap inference via ci = TRUE / p_value = TRUE. With "pairwise", each variable pair is computed on its own overlap of finite rows. This is more permissive for incomplete data, but different pairs can be based on different effective samples and different skipped-row sets, so the resulting matrix is less directly comparable across entries.

ci

Logical; if TRUE, attach percentile-bootstrap confidence intervals for each skipped correlation using the Wilcox (2015) B2 resampling scheme. Default FALSE.

p_value

Logical; if TRUE, attach bootstrap p-values for testing whether each skipped correlation is zero. Default FALSE.

conf_level

Confidence level used when ci = TRUE. Default 0.95.

n_threads

Integer \geq 1. Number of OpenMP threads. Defaults to getOption("matrixCorr.threads", 1L).

return_masks

Logical; if TRUE, attach compact pairwise skipped-row indices as an attribute. Default FALSE.

stand

Logical; if TRUE (default), each variable in the pair is centred by its median and divided by a robust scale estimate before the projection outlier search. The scale estimate is the MAD when positive, with fallback to \mathrm{IQR}/1.34898 and then the usual sample standard deviation if needed. This standardisation affects only outlier detection, not the final correlation computed on the retained observations.

outlier_rule

One of "idealf" (default) or "mad". The default uses the ideal-fourths interquartile width of projected distances; "mad" uses the median absolute deviation of projected distances.

cutoff

Positive numeric constant multiplying the projected spread in the outlier rule \mathrm{med}(d_{i\cdot}) + cutoff \times s(d_{i\cdot}). Larger values flag fewer observations as outliers; smaller values flag more. Default sqrt(qchisq(0.975, df = 2)).

n_boot

Integer \geq 2. Number of bootstrap resamples used when ci = TRUE or p_value = TRUE. Default 2000.

p_adjust

One of "none" (default), "hochberg", or "ecp". Optional familywise-error procedure applied to the matrix of bootstrap p-values. "hochberg" corresponds to method H in Wilcox, Rousselet, and Pernet (2018); "ecp" corresponds to their simulated critical-p-value method ECP.

fwe_level

Familywise-error level used when p_adjust = "hochberg" or "ecp". Default 0.05.

n_mc

Integer \geq 10. Number of null Monte Carlo data sets used when p_adjust = "ecp" to estimate the critical p-value. Default 1000.

seed

Optional positive integer used to seed the bootstrap resampling when ci = TRUE or p_value = TRUE. If NULL, a fresh internal seed is generated.

output

Output representation for the computed estimates.

  • "matrix" (default): full dense matrix; best when you need matrix algebra, dense heatmaps, or full compatibility with existing code.

  • "sparse": sparse matrix from Matrix containing only retained entries; best when many values are dropped by thresholding.

  • "edge_list": long-form data frame with columns row, col, value; convenient for filtering, joins, and network-style workflows.

threshold

Non-negative absolute-value filter for non-matrix outputs: keep entries with abs(value) >= threshold. Use threshold > 0 when you want only stronger associations (typically with output = "sparse" or "edge_list"). Keep threshold = 0 to retain all values. Must be 0 when output = "matrix".

diag

Logical; whether to include diagonal entries in "sparse" and "edge_list" outputs.

x

An object of class summary.skipped_corr.

var1, var2

Optional column names or 1-based column indices used by skipped_corr_masks() to extract the skipped-row indices for one pair.

digits

Integer; number of digits to print.

n

Optional row threshold for compact preview output.

topn

Optional number of leading/trailing rows to show when truncated.

max_vars

Optional maximum number of visible columns; NULL derives this from console width.

width

Optional display width; defaults to getOption("width").

ci_digits

Integer; digits for skipped-correlation confidence limits.

show_ci

One of "yes" or "no".

show_p

One of "auto", "yes", "no". For print(), "auto" keeps the compact matrix-only display; use "yes" to also print pairwise p-values.

...

Additional arguments passed to the underlying print or plot helper.

title

Character; plot title.

low_color, high_color, mid_color

Colors used in the heatmap.

value_text_size

Numeric text size for overlaid cell values.

ci_text_size

Text size for confidence intervals in the heatmap.

show_value

Logical; if TRUE (default), overlay numeric values on the heatmap tiles.

object

An object of class skipped_corr.

Details

Let X \in \mathbb{R}^{n \times p} be a numeric matrix with rows as observations and columns as variables. For a given pair of columns (x, y), write the observed bivariate points as u_i = (x_i, y_i)^\top, i=1,\ldots,n. If stand = TRUE, each margin is first centred by its median and divided by a robust scale estimate before outlier detection; otherwise the original pair is used. The robust scale is the MAD when positive, with fallback to \mathrm{IQR}/1.34898 and then the usual sample standard deviation if needed. Let \tilde u_i denote the resulting points and let c be the componentwise median center of the detection cloud.

For each observation i, define the direction vector b_i = \tilde u_i - c. When \|b_i\| > 0, all observations are projected onto the line through c in direction b_i. The projected distances are

d_{ij} \;=\; \frac{|(\tilde u_j - c)^\top b_i|}{\|b_i\|}, \qquad j=1,\ldots,n.

For each direction i, observation j is flagged as an outlier if

d_{ij} \;>\; \mathrm{med}(d_{i\cdot}) + g\, s(d_{i\cdot}), \qquad g = \code{cutoff},

where s(\cdot) is either the ideal-fourths interquartile width (outlier_rule = "idealf") or the median absolute deviation (outlier_rule = "mad"). An observation is removed if it is flagged for at least one projection direction. The skipped correlation is then the ordinary Pearson or Spearman correlation computed from the retained observations:

r_{\mathrm{skip}}(x,y) \;=\; \mathrm{cor}\!\left(x_{\mathcal{K}}, y_{\mathcal{K}}\right),

where \mathcal{K} is the index set of observations not flagged as outliers.

Unlike marginally robust methods such as pbcor(), wincor(), or bicor(), skipped correlation is explicitly pairwise because outlier detection depends on the joint geometry of each variable pair. As a result, the reported matrix need not be positive semidefinite, even with complete data.

Computational notes. In the complete-data path, each column pair requires a full bivariate projection search, so the dominant cost is higher than for marginal robust methods. The implementation evaluates pairs in 'C++'; where available, pairs are processed with 'OpenMP' parallelism. With na_method = "pairwise", each pair is recomputed on its overlap of non-missing rows.

Bootstrap inference. When ci = TRUE or p_value = TRUE, the implementation uses the percentile-bootstrap strategy studied by Wilcox (2015). Each bootstrap replicate resamples whole observation pairs with replacement, reruns the skipped-correlation outlier detection on the resampled data, and recomputes the skipped correlation on the retained observations. This corresponds to Wilcox's B2 method and avoids the statistically unsatisfactory shortcut of removing outliers only once before bootstrapping. Bootstrap inference currently requires complete data (na_method = "error"). When p_adjust = "hochberg", the bootstrap p-values are processed with Hochberg's step-up procedure (method H in Wilcox, Rousselet, and Pernet, 2018). When p_adjust = "ecp", the package follows their ECP method and simulates n_mc null data sets from a p-variate normal distribution with identity covariance, recomputes the pairwise bootstrap p-values for each null data set, stores the minimum p-value from each run, and estimates the fwe_level quantile of that null distribution using the Harrell-Davis estimator. Hypotheses are then rejected when their observed bootstrap p-values are less than or equal to the estimated critical p-value. The calibrated H1 procedure from Wilcox, Rousselet, and Pernet (2018) is not currently implemented.

Value

A symmetric correlation matrix with class skipped_corr and attributes method = "skipped_correlation", description, and package = "matrixCorr". When return_masks = TRUE, the matrix also carries a skipped_masks attribute containing compact pairwise skipped-row indices. The diagnostics attribute stores per-pair complete-case counts and skipped-row counts/proportions. When ci = TRUE or p_value = TRUE, bootstrap inference matrices are attached via attributes.

Author(s)

Thiago de Paula Oliveira

References

Wilcox, R. R. (2004). Inferences based on a skipped correlation coefficient. Journal of Applied Statistics, 31(2), 131-143. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/0266476032000148821")}

Wilcox, R. R. (2015). Inferences about the skipped correlation coefficient: Dealing with heteroscedasticity and non-normality. Journal of Modern Applied Statistical Methods, 14(1), 172-188. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.22237/jmasm/1430453580")}

Wilcox, R. R., Rousselet, G. A., & Pernet, C. R. (2018). Improved methods for making inferences about multiple skipped correlations. Journal of Statistical Computation and Simulation, 88(16), 3116-3131. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1080/00949655.2018.1501051")}

See Also

pbcor(), wincor(), bicor()

Examples

set.seed(12)
X <- matrix(rnorm(160 * 4), ncol = 4)
X[1, 1] <- 9
X[1, 2] <- -8

R <- skipped_corr(X, method = "pearson")
print(R, digits = 2)
summary(R)
plot(R)

Rm <- skipped_corr(X, method = "pearson", return_masks = TRUE)
skipped_corr_masks(Rm, 1, 2)

# Example 1:
Xm <- as.matrix(datasets::mtcars[, c("mpg", "disp", "hp", "wt")])
Rm2 <- skipped_corr(Xm, method = "spearman")
print(Rm2, digits = 2)

# Example 2:
Ri <- skipped_corr(Xm, method = "pearson", ci = TRUE, n_boot = 40, seed = 1)
Ri$ci

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(R)
}


matrixCorr documentation built on April 18, 2026, 5:06 p.m.