wincor: Pairwise Winsorized correlation

View source: R/wincor.R

wincorR Documentation

Pairwise Winsorized correlation

Description

Computes all pairwise Winsorized correlation coefficients for the numeric columns of a matrix or data frame using a high-performance 'C++' backend.

This function Winsorizes each margin at proportion tr and then computes ordinary Pearson correlation on the Winsorized values. It is a simple robust alternative to Pearson correlation when the main concern is unusually large or small observations in the marginal distributions.

Usage

wincor(
  data,
  na_method = c("error", "pairwise"),
  ci = FALSE,
  p_value = FALSE,
  conf_level = 0.95,
  n_threads = getOption("matrixCorr.threads", 1L),
  tr = 0.2,
  n_boot = 500L,
  seed = NULL,
  output = c("matrix", "sparse", "edge_list"),
  threshold = 0,
  diag = TRUE
)

## S3 method for class 'wincor'
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

## S3 method for class 'wincor'
plot(
  x,
  title = "Winsorized correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  show_value = TRUE,
  ...
)

## S3 method for class 'wincor'
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  p_digits = 4,
  show_ci = NULL,
  ...
)

## S3 method for class 'summary.wincor'
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Arguments

data

A numeric matrix or a data frame with at least two numeric columns. All non-numeric columns will be excluded.

na_method

One of "error" (default) or "pairwise".

ci

Logical (default FALSE). If TRUE, attach percentile bootstrap confidence intervals for each pairwise estimate.

p_value

Logical (default FALSE). If TRUE, attach the method-specific large-sample test statistic and two-sided p-value for each pairwise estimate.

conf_level

Confidence level used when ci = TRUE. Default 0.95.

n_threads

Integer \geq 1. Number of OpenMP threads. Defaults to getOption("matrixCorr.threads", 1L).

tr

Winsorization proportion in [0, 0.5). For a sample of size n, let g = \lfloor tr \cdot n \rfloor; the g smallest observations are set to the (g+1)-st order statistic and the g largest observations are set to the (n-g)-th order statistic. Default 0.2.

n_boot

Integer \geq 1. Number of bootstrap resamples used when ci = TRUE. Default 500.

seed

Optional positive integer used to seed the bootstrap resampling when ci = TRUE. If NULL, the current random-number stream is used.

output

Output representation for the computed estimates.

  • "matrix" (default): full dense matrix; best when you need matrix algebra, dense heatmaps, or full compatibility with existing code.

  • "sparse": sparse matrix from Matrix containing only retained entries; best when many values are dropped by thresholding.

  • "edge_list": long-form data frame with columns row, col, value; convenient for filtering, joins, and network-style workflows.

threshold

Non-negative absolute-value filter for non-matrix outputs: keep entries with abs(value) >= threshold. Use threshold > 0 when you want only stronger associations (typically with output = "sparse" or "edge_list"). Keep threshold = 0 to retain all values. Must be 0 when output = "matrix".

diag

Logical; whether to include diagonal entries in "sparse" and "edge_list" outputs.

x

An object of class summary.wincor.

digits

Integer; number of digits to print.

n

Optional row threshold for compact preview output.

topn

Optional number of leading/trailing rows to show when truncated.

max_vars

Optional maximum number of visible columns; NULL derives this from console width.

width

Optional display width; defaults to getOption("width").

show_ci

One of "yes" or "no".

...

Additional arguments passed to the underlying print or plot helper.

title

Character; plot title.

low_color, high_color, mid_color

Colors used in the heatmap.

value_text_size

Numeric text size for overlaid cell values.

show_value

Logical; if TRUE (default), overlay numeric values on the heatmap tiles.

object

An object of class wincor.

ci_digits

Integer; digits used for confidence limits in pairwise summaries.

p_digits

Integer; digits used for p-values in pairwise summaries.

Details

Let X \in \mathbb{R}^{n \times p} be a numeric matrix with rows as observations and columns as variables. For a column x = (x_i)_{i=1}^n, write the order statistics as x_{(1)} \le \cdots \le x_{(n)} and let g = \lfloor tr \cdot n \rfloor. The Winsorized values can be written as

x_i^{(w)} \;=\; \max\!\bigl\{x_{(g+1)},\, \min(x_i, x_{(n-g)})\bigr\}.

For two columns x and y, the Winsorized correlation is the ordinary Pearson correlation computed from x^{(w)} and y^{(w)}:

r_w(x,y) \;=\; \frac{\sum_{i=1}^n (x_i^{(w)}-\bar x^{(w)})(y_i^{(w)}-\bar y^{(w)})} {\sqrt{\sum_{i=1}^n (x_i^{(w)}-\bar x^{(w)})^2}\; \sqrt{\sum_{i=1}^n (y_i^{(w)}-\bar y^{(w)})^2}}.

In matrix form, let X^{(w)} contain the Winsorized columns and define the centred, unit-norm columns

z_{\cdot j} = \frac{x_{\cdot j}^{(w)} - \bar x_j^{(w)} \mathbf{1}} {\sqrt{\sum_{i=1}^n (x_{ij}^{(w)}-\bar x_j^{(w)})^2}}, \qquad j=1,\ldots,p.

If Z = [z_{\cdot 1}, \ldots, z_{\cdot p}], then the Winsorized correlation matrix is

R_w \;=\; Z^\top Z.

Winsorization acts on each margin separately, so it guards against marginal outliers and heavy tails but does not target unusual points in the joint cloud. This implementation Winsorizes each column in 'C++', centres and normalises it, and forms the complete-data matrix from cross-products. With na_method = "pairwise", each pair is recomputed on its overlap of non-missing rows. As with Pearson correlation, the complete-data path yields a symmetric positive semidefinite matrix, whereas pairwise deletion can break positive semidefiniteness. If the Winsorized variance of a column is zero, correlations involving that column are returned as NA.

When p_value = TRUE, inference follows the method-specific test based on

T_{ij} = r_{w,ij}\sqrt{\frac{n_{ij} - 2}{1 - r_{w,ij}^2}},

evaluated against a t-distribution with n_{ij} - 2g_{ij} - 2 degrees of freedom, where g_{ij} = \lfloor tr \cdot n_{ij} \rfloor and n_{ij} is the pairwise complete-case sample size for the corresponding column pair. The p-value is reported only when the pair is not identical and the resulting degrees of freedom are positive. When ci = TRUE, the interval is a percentile bootstrap interval based on n_{\mathrm{boot}} resamples drawn from the pairwise complete cases. If \tilde r_{w,(1)} \le \cdots \le \tilde r_{w,(B)} denotes the sorted bootstrap sample of finite estimates with B retained resamples, the reported limits are

\tilde r_{w,(\ell)} \quad \text{and} \quad \tilde r_{w,(u)},

where \ell = \lfloor (\alpha/2) B + 0.5 \rfloor and u = \lfloor (1-\alpha/2) B + 0.5 \rfloor for \alpha = 1 - \mathrm{conf\_level}. Resamples that yield undefined estimates are discarded before the percentile limits are formed.

Computational complexity. In the complete-data path, Winsorizing the columns requires sorting within each column, and forming the cross-product matrix costs O(n p^2) with O(p^2) output storage. When ci = TRUE, the bootstrap cost is incurred separately for each column pair.

Value

A symmetric correlation matrix with class wincor and attributes method = "winsorized_correlation", description, and package = "matrixCorr". When ci = TRUE, the returned object also carries a ci attribute with elements est, lwr.ci, upr.ci, conf.level, and ci.method, plus attr(x, "conf.level"). When p_value = TRUE, it also carries an inference attribute with elements estimate, statistic, parameter, p_value, n_obs, and alternative. When either inferential option is requested, the object also carries diagnostics$n_complete.

Author(s)

Thiago de Paula Oliveira

References

Wilcox, R. R. (1993). Some results on a Winsorized correlation coefficient. British Journal of Mathematical and Statistical Psychology, 46(2), 339-349. \Sexpr[results=rd]{tools:::Rd_expr_doi("10.1111/j.2044-8317.1993.tb01020.x")}

Wilcox, R. R. (2012). Introduction to Robust Estimation and Hypothesis Testing (3rd ed.). Academic Press.

See Also

pbcor(), skipped_corr(), bicor()

Examples

set.seed(11)
X <- matrix(rnorm(180 * 4), ncol = 4)
X[sample(length(X), 6)] <- X[sample(length(X), 6)] - 12

R <- wincor(X, tr = 0.2)
print(R, digits = 2)
summary(R)
plot(R)

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(R)
}


matrixCorr documentation built on April 18, 2026, 5:06 p.m.