| polychoric | R Documentation |
Computes the polychoric correlation for either a pair of ordinal variables or all pairwise combinations of ordinal columns in a matrix/data frame.
polychoric(
data,
y = NULL,
na_method = c("error", "pairwise"),
ci = FALSE,
p_value = FALSE,
conf_level = 0.95,
correct = 0.5,
output = c("matrix", "sparse", "edge_list"),
threshold = 0,
diag = TRUE,
...
)
## S3 method for class 'polychoric_corr'
print(
x,
digits = 4,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
show_ci = NULL,
...
)
## S3 method for class 'polychoric_corr'
plot(
x,
title = "Polychoric correlation heatmap",
low_color = "indianred1",
high_color = "steelblue1",
mid_color = "white",
value_text_size = 4,
show_value = TRUE,
...
)
## S3 method for class 'polychoric_corr'
summary(
object,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
ci_digits = 3,
p_digits = 4,
show_ci = NULL,
...
)
## S3 method for class 'summary.polychoric_corr'
print(
x,
digits = NULL,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
show_ci = NULL,
...
)
data |
An ordinal vector, matrix, or data frame. Supported columns are factors, ordered factors, logical values, or integer-like numerics. In matrix/data-frame mode, only supported ordinal columns are retained. |
y |
Optional second ordinal vector. When supplied, the function returns a single polychoric correlation estimate. |
na_method |
Character scalar controlling missing-data handling.
|
ci |
Logical (default |
p_value |
Logical (default |
conf_level |
Confidence level used when |
correct |
Non-negative continuity correction added to zero-count cells.
Default is |
output |
Output representation for the computed estimates.
|
threshold |
Non-negative absolute-value filter for non-matrix outputs:
keep entries with |
diag |
Logical; whether to include diagonal entries in
|
... |
Additional arguments passed to |
x |
An object of class |
digits |
Integer; number of decimal places to print. |
n |
Optional row threshold for compact preview output. |
topn |
Optional number of leading/trailing rows to show when truncated. |
max_vars |
Optional maximum number of visible columns; |
width |
Optional display width; defaults to |
show_ci |
One of |
title |
Plot title. Default is |
low_color |
Color for the minimum correlation. |
high_color |
Color for the maximum correlation. |
mid_color |
Color for zero correlation. |
value_text_size |
Font size used in tile labels. |
show_value |
Logical; if |
object |
An object of class |
ci_digits |
Integer; digits for confidence limits in the pairwise summary. |
p_digits |
Integer; digits for p-values in the pairwise summary. |
The polychoric correlation generalises the tetrachoric model to ordered
categorical variables with more than two levels. It assumes latent
standard-normal variables Z_1, Z_2 with correlation \rho, and
cut-points
-\infty = \alpha_0 < \alpha_1 < \cdots < \alpha_R = \infty and
-\infty = \beta_0 < \beta_1 < \cdots < \beta_C = \infty such that
X = r \iff \alpha_{r-1} < Z_1 \le \alpha_r,
\qquad
Y = c \iff \beta_{c-1} < Z_2 \le \beta_c.
For an observed R \times C contingency table with counts n_{rc},
the thresholds are estimated from the marginal cumulative proportions:
\alpha_r = \Phi^{-1}\!\Big(\sum_{k \le r} P(X = k)\Big),
\qquad
\beta_c = \Phi^{-1}\!\Big(\sum_{k \le c} P(Y = k)\Big).
Holding those thresholds fixed, the log-likelihood for the latent correlation is
\ell(\rho) = \sum_{r=1}^{R}\sum_{c=1}^{C}
n_{rc} \log \Pr\!\big(
\alpha_{r-1} < Z_1 \le \alpha_r,\;
\beta_{c-1} < Z_2 \le \beta_c
\mid \rho \big),
and the estimator returned is the maximiser over \rho \in (-1,1).
The C++ implementation performs a dense one-dimensional search followed by
Brent refinement.
The argument correct adds a non-negative continuity correction to
empty cells before marginal threshold estimation and likelihood evaluation.
This avoids numerical failures for sparse tables with structurally zero cells.
When correct = 0 and zero cells are present, the corresponding fit can
be boundary-driven rather than a regular interior maximum-likelihood problem.
The returned object stores sparse-fit diagnostics and the thresholds used for
estimation so those cases can be inspected explicitly.
Assumptions. The coefficient is appropriate when both observed ordinal variables are viewed as discretisations of jointly normal latent variables. The optional p-values and confidence intervals adopt this latent-normal interpretation and use the same likelihood that defines the polychoric estimate. These inferential quantities are therefore model-based and should not be interpreted as distribution-free summaries.
Inference. When ci = TRUE or p_value = TRUE, the
function refits the pairwise polychoric model by maximum likelihood and
obtains the observed information matrix numerically in C++. The reported
confidence interval is a Wald interval
\hat\rho \pm z_{1-\alpha/2}\operatorname{SE}(\hat\rho), and the
reported p-value is from the large-sample Wald z-test for
H_0:\rho = 0. These inferential quantities are only computed when
explicitly requested.
In matrix/data-frame mode, all pairwise polychoric correlations are computed
between supported ordinal columns. Diagonal entries are 1 for
non-degenerate columns and NA when a column has fewer than two
observed levels.
Computational complexity. For p ordinal variables, the matrix
path evaluates p(p-1)/2 bivariate likelihoods. Each pair optimises a
single scalar parameter \rho, so the main cost is repeated evaluation
of bivariate normal rectangle probabilities.
If y is supplied, a numeric scalar with attributes
diagnostics and thresholds. Otherwise a symmetric matrix of
class polychoric_corr with attributes method,
description, package = "matrixCorr", diagnostics,
thresholds, and correct. When p_value = TRUE, the
returned object also carries an inference attribute with elements
estimate, statistic, parameter, p_value, and
n_obs. When ci = TRUE, it also carries a ci attribute
with elements est, lwr.ci, upr.ci, conf.level,
and ci.method, plus attr(x, "conf.level"). Scalar outputs keep
the same point estimate and gain the same metadata only when inference is
requested. In matrix mode, output = "edge_list" returns a data frame with columns
row, col, value; output = "sparse" returns a
symmetric sparse matrix.
Thiago de Paula Oliveira
Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44(4), 443-460.
set.seed(124)
n <- 1200
Sigma <- matrix(c(
1.00, 0.60, 0.40,
0.60, 1.00, 0.50,
0.40, 0.50, 1.00
), 3, 3, byrow = TRUE)
Z <- mnormt::rmnorm(n = n, mean = rep(0, 3), varcov = Sigma)
Y <- data.frame(
y1 = ordered(cut(
Z[, 1],
breaks = c(-Inf, -0.7, 0.4, Inf),
labels = c("low", "mid", "high")
)),
y2 = ordered(cut(
Z[, 2],
breaks = c(-Inf, -1.0, -0.1, 0.8, Inf),
labels = c("1", "2", "3", "4")
)),
y3 = ordered(cut(
Z[, 3],
breaks = c(-Inf, -0.4, 0.2, 1.1, Inf),
labels = c("A", "B", "C", "D")
))
)
pc <- polychoric(Y)
print(pc, digits = 3)
summary(pc)
plot(pc)
polychoric(Y, output = "edge_list", threshold = 0.3, diag = FALSE)
# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
view_corr_shiny(pc)
}
# latent Pearson correlations used to generate the ordinal variables
round(stats::cor(Z), 2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.