polychoric: Pairwise Polychoric Correlation
In matrixCorr: Collection of Correlation and Association Estimators

polychoric

R Documentation

Pairwise Polychoric Correlation

Description

Computes the polychoric correlation for either a pair of ordinal variables or all pairwise combinations of ordinal columns in a matrix/data frame.

Usage

polychoric(
  data,
  y = NULL,
  na_method = c("error", "pairwise"),
  ci = FALSE,
  p_value = FALSE,
  conf_level = 0.95,
  correct = 0.5,
  output = c("matrix", "sparse", "edge_list"),
  threshold = 0,
  diag = TRUE,
  ...
)

## S3 method for class 'polychoric_corr'
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

## S3 method for class 'polychoric_corr'
plot(
  x,
  title = "Polychoric correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  show_value = TRUE,
  ...
)

## S3 method for class 'polychoric_corr'
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  p_digits = 4,
  show_ci = NULL,
  ...
)

## S3 method for class 'summary.polychoric_corr'
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Arguments

`data`	An ordinal vector, matrix, or data frame. Supported columns are factors, ordered factors, logical values, or integer-like numerics. In matrix/data-frame mode, only supported ordinal columns are retained.
`y`	Optional second ordinal vector. When supplied, the function returns a single polychoric correlation estimate.
`na_method`	Character scalar controlling missing-data handling. `"error"` rejects missing values. `"pairwise"` uses pairwise complete cases.
`ci`	Logical (default `FALSE`). If `TRUE`, attach model-based large-sample Wald confidence intervals derived from the observed information matrix of the latent-variable likelihood.
`p_value`	Logical (default `FALSE`). If `TRUE`, attach model-based large-sample Wald p-values and test statistics for each estimated latent correlation.
`conf_level`	Confidence level used when `ci = TRUE`. Default is `0.95`.
`correct`	Non-negative continuity correction added to zero-count cells. Default is `0.5`.
`output`	Output representation for the computed estimates. `"matrix"` (default): full dense matrix; best when you need matrix algebra, dense heatmaps, or full compatibility with existing code. `"sparse"`: sparse matrix from Matrix containing only retained entries; best when many values are dropped by thresholding. `"edge_list"`: long-form data frame with columns `row`, `col`, `value`; convenient for filtering, joins, and network-style workflows.
`threshold`	Non-negative absolute-value filter for non-matrix outputs: keep entries with `abs(value) >= threshold`. Use `threshold > 0` when you want only stronger associations (typically with `output = "sparse"` or `"edge_list"`). Keep `threshold = 0` to retain all values. Must be `0` when `output = "matrix"`.
`diag`	Logical; whether to include diagonal entries in `"sparse"` and `"edge_list"` outputs.
`...`	Additional arguments passed to `print()`.
`x`	An object of class `summary.polychoric_corr`.
`digits`	Integer; number of decimal places to print.
`n`	Optional row threshold for compact preview output.
`topn`	Optional number of leading/trailing rows to show when truncated.
`max_vars`	Optional maximum number of visible columns; `NULL` derives this from console width.
`width`	Optional display width; defaults to `getOption("width")`.
`show_ci`	One of `"yes"` or `"no"`.
`title`	Plot title. Default is `"Polychoric correlation heatmap"`.
`low_color`	Color for the minimum correlation.
`high_color`	Color for the maximum correlation.
`mid_color`	Color for zero correlation.
`value_text_size`	Font size used in tile labels.
`show_value`	Logical; if `TRUE` (default), overlay numeric values on the heatmap tiles.
`object`	An object of class `polychoric_corr`.
`ci_digits`	Integer; digits for confidence limits in the pairwise summary.
`p_digits`	Integer; digits for p-values in the pairwise summary.

Details

The polychoric correlation generalises the tetrachoric model to ordered categorical variables with more than two levels. It assumes latent standard-normal variables Z_1, Z_2 with correlation \rho, and cut-points -\infty = \alpha_0 < \alpha_1 < \cdots < \alpha_R = \infty and -\infty = \beta_0 < \beta_1 < \cdots < \beta_C = \infty such that

X = r \iff \alpha_{r-1} < Z_1 \le \alpha_r, \qquad Y = c \iff \beta_{c-1} < Z_2 \le \beta_c.

For an observed R \times C contingency table with counts n_{rc}, the thresholds are estimated from the marginal cumulative proportions:

\alpha_r = \Phi^{-1}\!\Big(\sum_{k \le r} P(X = k)\Big), \qquad \beta_c = \Phi^{-1}\!\Big(\sum_{k \le c} P(Y = k)\Big).

Holding those thresholds fixed, the log-likelihood for the latent correlation is

\ell(\rho) = \sum_{r=1}^{R}\sum_{c=1}^{C} n_{rc} \log \Pr\!\big( \alpha_{r-1} < Z_1 \le \alpha_r,\; \beta_{c-1} < Z_2 \le \beta_c \mid \rho \big),

and the estimator returned is the maximiser over \rho \in (-1,1). The C++ implementation performs a dense one-dimensional search followed by Brent refinement.

The argument correct adds a non-negative continuity correction to empty cells before marginal threshold estimation and likelihood evaluation. This avoids numerical failures for sparse tables with structurally zero cells. When correct = 0 and zero cells are present, the corresponding fit can be boundary-driven rather than a regular interior maximum-likelihood problem. The returned object stores sparse-fit diagnostics and the thresholds used for estimation so those cases can be inspected explicitly.

Assumptions. The coefficient is appropriate when both observed ordinal variables are viewed as discretisations of jointly normal latent variables. The optional p-values and confidence intervals adopt this latent-normal interpretation and use the same likelihood that defines the polychoric estimate. These inferential quantities are therefore model-based and should not be interpreted as distribution-free summaries.

Inference. When ci = TRUE or p_value = TRUE, the function refits the pairwise polychoric model by maximum likelihood and obtains the observed information matrix numerically in C++. The reported confidence interval is a Wald interval \hat\rho \pm z_{1-\alpha/2}\operatorname{SE}(\hat\rho), and the reported p-value is from the large-sample Wald z-test for H_0:\rho = 0. These inferential quantities are only computed when explicitly requested.

In matrix/data-frame mode, all pairwise polychoric correlations are computed between supported ordinal columns. Diagonal entries are 1 for non-degenerate columns and NA when a column has fewer than two observed levels.

Computational complexity. For p ordinal variables, the matrix path evaluates p(p-1)/2 bivariate likelihoods. Each pair optimises a single scalar parameter \rho, so the main cost is repeated evaluation of bivariate normal rectangle probabilities.

Value

If y is supplied, a numeric scalar with attributes diagnostics and thresholds. Otherwise a symmetric matrix of class polychoric_corr with attributes method, description, package = "matrixCorr", diagnostics, thresholds, and correct. When p_value = TRUE, the returned object also carries an inference attribute with elements estimate, statistic, parameter, p_value, and n_obs. When ci = TRUE, it also carries a ci attribute with elements est, lwr.ci, upr.ci, conf.level, and ci.method, plus attr(x, "conf.level"). Scalar outputs keep the same point estimate and gain the same metadata only when inference is requested. In matrix mode, output = "edge_list" returns a data frame with columns row, col, value; output = "sparse" returns a symmetric sparse matrix.

Author(s)

Thiago de Paula Oliveira

References

Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44(4), 443-460.

Examples


set.seed(124)
n <- 1200
Sigma <- matrix(c(
  1.00, 0.60, 0.40,
  0.60, 1.00, 0.50,
  0.40, 0.50, 1.00
), 3, 3, byrow = TRUE)

Z <- mnormt::rmnorm(n = n, mean = rep(0, 3), varcov = Sigma)
Y <- data.frame(
  y1 = ordered(cut(
    Z[, 1],
    breaks = c(-Inf, -0.7, 0.4, Inf),
    labels = c("low", "mid", "high")
  )),
  y2 = ordered(cut(
    Z[, 2],
    breaks = c(-Inf, -1.0, -0.1, 0.8, Inf),
    labels = c("1", "2", "3", "4")
  )),
  y3 = ordered(cut(
    Z[, 3],
    breaks = c(-Inf, -0.4, 0.2, 1.1, Inf),
    labels = c("A", "B", "C", "D")
  ))
)

pc <- polychoric(Y)
print(pc, digits = 3)
summary(pc)
plot(pc)
polychoric(Y, output = "edge_list", threshold = 0.3, diag = FALSE)

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(pc)
}

# latent Pearson correlations used to generate the ordinal variables
round(stats::cor(Z), 2)

matrixCorr documentation built on April 18, 2026, 5:06 p.m.