polychoric: Pairwise Polychoric Correlation

View source: R/latent_corr.R

polychoricR Documentation

Pairwise Polychoric Correlation

Description

Computes the polychoric correlation for either a pair of ordinal variables or all pairwise combinations of ordinal columns in a matrix/data frame.

Usage

polychoric(
  data,
  y = NULL,
  na_method = c("error", "pairwise"),
  ci = FALSE,
  p_value = FALSE,
  conf_level = 0.95,
  correct = 0.5,
  output = c("matrix", "sparse", "edge_list"),
  threshold = 0,
  diag = TRUE,
  ...
)

## S3 method for class 'polychoric_corr'
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

## S3 method for class 'polychoric_corr'
plot(
  x,
  title = "Polychoric correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  show_value = TRUE,
  ...
)

## S3 method for class 'polychoric_corr'
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  p_digits = 4,
  show_ci = NULL,
  ...
)

## S3 method for class 'summary.polychoric_corr'
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Arguments

data

An ordinal vector, matrix, or data frame. Supported columns are factors, ordered factors, logical values, or integer-like numerics. In matrix/data-frame mode, only supported ordinal columns are retained.

y

Optional second ordinal vector. When supplied, the function returns a single polychoric correlation estimate.

na_method

Character scalar controlling missing-data handling. "error" rejects missing values. "pairwise" uses pairwise complete cases.

ci

Logical (default FALSE). If TRUE, attach model-based large-sample Wald confidence intervals derived from the observed information matrix of the latent-variable likelihood.

p_value

Logical (default FALSE). If TRUE, attach model-based large-sample Wald p-values and test statistics for each estimated latent correlation.

conf_level

Confidence level used when ci = TRUE. Default is 0.95.

correct

Non-negative continuity correction added to zero-count cells. Default is 0.5.

output

Output representation for the computed estimates.

  • "matrix" (default): full dense matrix; best when you need matrix algebra, dense heatmaps, or full compatibility with existing code.

  • "sparse": sparse matrix from Matrix containing only retained entries; best when many values are dropped by thresholding.

  • "edge_list": long-form data frame with columns row, col, value; convenient for filtering, joins, and network-style workflows.

threshold

Non-negative absolute-value filter for non-matrix outputs: keep entries with abs(value) >= threshold. Use threshold > 0 when you want only stronger associations (typically with output = "sparse" or "edge_list"). Keep threshold = 0 to retain all values. Must be 0 when output = "matrix".

diag

Logical; whether to include diagonal entries in "sparse" and "edge_list" outputs.

...

Additional arguments passed to print().

x

An object of class summary.polychoric_corr.

digits

Integer; number of decimal places to print.

n

Optional row threshold for compact preview output.

topn

Optional number of leading/trailing rows to show when truncated.

max_vars

Optional maximum number of visible columns; NULL derives this from console width.

width

Optional display width; defaults to getOption("width").

show_ci

One of "yes" or "no".

title

Plot title. Default is "Polychoric correlation heatmap".

low_color

Color for the minimum correlation.

high_color

Color for the maximum correlation.

mid_color

Color for zero correlation.

value_text_size

Font size used in tile labels.

show_value

Logical; if TRUE (default), overlay numeric values on the heatmap tiles.

object

An object of class polychoric_corr.

ci_digits

Integer; digits for confidence limits in the pairwise summary.

p_digits

Integer; digits for p-values in the pairwise summary.

Details

The polychoric correlation generalises the tetrachoric model to ordered categorical variables with more than two levels. It assumes latent standard-normal variables Z_1, Z_2 with correlation \rho, and cut-points -\infty = \alpha_0 < \alpha_1 < \cdots < \alpha_R = \infty and -\infty = \beta_0 < \beta_1 < \cdots < \beta_C = \infty such that

X = r \iff \alpha_{r-1} < Z_1 \le \alpha_r, \qquad Y = c \iff \beta_{c-1} < Z_2 \le \beta_c.

For an observed R \times C contingency table with counts n_{rc}, the thresholds are estimated from the marginal cumulative proportions:

\alpha_r = \Phi^{-1}\!\Big(\sum_{k \le r} P(X = k)\Big), \qquad \beta_c = \Phi^{-1}\!\Big(\sum_{k \le c} P(Y = k)\Big).

Holding those thresholds fixed, the log-likelihood for the latent correlation is

\ell(\rho) = \sum_{r=1}^{R}\sum_{c=1}^{C} n_{rc} \log \Pr\!\big( \alpha_{r-1} < Z_1 \le \alpha_r,\; \beta_{c-1} < Z_2 \le \beta_c \mid \rho \big),

and the estimator returned is the maximiser over \rho \in (-1,1). The C++ implementation performs a dense one-dimensional search followed by Brent refinement.

The argument correct adds a non-negative continuity correction to empty cells before marginal threshold estimation and likelihood evaluation. This avoids numerical failures for sparse tables with structurally zero cells. When correct = 0 and zero cells are present, the corresponding fit can be boundary-driven rather than a regular interior maximum-likelihood problem. The returned object stores sparse-fit diagnostics and the thresholds used for estimation so those cases can be inspected explicitly.

Assumptions. The coefficient is appropriate when both observed ordinal variables are viewed as discretisations of jointly normal latent variables. The optional p-values and confidence intervals adopt this latent-normal interpretation and use the same likelihood that defines the polychoric estimate. These inferential quantities are therefore model-based and should not be interpreted as distribution-free summaries.

Inference. When ci = TRUE or p_value = TRUE, the function refits the pairwise polychoric model by maximum likelihood and obtains the observed information matrix numerically in C++. The reported confidence interval is a Wald interval \hat\rho \pm z_{1-\alpha/2}\operatorname{SE}(\hat\rho), and the reported p-value is from the large-sample Wald z-test for H_0:\rho = 0. These inferential quantities are only computed when explicitly requested.

In matrix/data-frame mode, all pairwise polychoric correlations are computed between supported ordinal columns. Diagonal entries are 1 for non-degenerate columns and NA when a column has fewer than two observed levels.

Computational complexity. For p ordinal variables, the matrix path evaluates p(p-1)/2 bivariate likelihoods. Each pair optimises a single scalar parameter \rho, so the main cost is repeated evaluation of bivariate normal rectangle probabilities.

Value

If y is supplied, a numeric scalar with attributes diagnostics and thresholds. Otherwise a symmetric matrix of class polychoric_corr with attributes method, description, package = "matrixCorr", diagnostics, thresholds, and correct. When p_value = TRUE, the returned object also carries an inference attribute with elements estimate, statistic, parameter, p_value, and n_obs. When ci = TRUE, it also carries a ci attribute with elements est, lwr.ci, upr.ci, conf.level, and ci.method, plus attr(x, "conf.level"). Scalar outputs keep the same point estimate and gain the same metadata only when inference is requested. In matrix mode, output = "edge_list" returns a data frame with columns row, col, value; output = "sparse" returns a symmetric sparse matrix.

Author(s)

Thiago de Paula Oliveira

References

Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44(4), 443-460.

Examples


set.seed(124)
n <- 1200
Sigma <- matrix(c(
  1.00, 0.60, 0.40,
  0.60, 1.00, 0.50,
  0.40, 0.50, 1.00
), 3, 3, byrow = TRUE)

Z <- mnormt::rmnorm(n = n, mean = rep(0, 3), varcov = Sigma)
Y <- data.frame(
  y1 = ordered(cut(
    Z[, 1],
    breaks = c(-Inf, -0.7, 0.4, Inf),
    labels = c("low", "mid", "high")
  )),
  y2 = ordered(cut(
    Z[, 2],
    breaks = c(-Inf, -1.0, -0.1, 0.8, Inf),
    labels = c("1", "2", "3", "4")
  )),
  y3 = ordered(cut(
    Z[, 3],
    breaks = c(-Inf, -0.4, 0.2, 1.1, Inf),
    labels = c("A", "B", "C", "D")
  ))
)

pc <- polychoric(Y)
print(pc, digits = 3)
summary(pc)
plot(pc)
polychoric(Y, output = "edge_list", threshold = 0.3, diag = FALSE)

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(pc)
}

# latent Pearson correlations used to generate the ordinal variables
round(stats::cor(Z), 2)


matrixCorr documentation built on April 18, 2026, 5:06 p.m.