4. Latent and Mixed-Scale Correlation
In matrixCorr: Collection of Correlation and Association Estimators

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  warning = FALSE,
  message = FALSE
)

Scope

This vignette covers the latent-correlation estimators used when the observed data are binary, ordinal, or mixed. These methods do not target the same quantity as an ordinary Pearson correlation on coded categories. They are designed for settings where the observed variables are treated as thresholded versions of latent continuous variables.

The relevant functions are:

tetrachoric()
polychoric()
polyserial()
biserial()

A small latent-data example

library(matrixCorr)

set.seed(30)
n <- 500
Sigma <- matrix(c(
  1.00, 0.55, 0.35, 0.20,
  0.55, 1.00, 0.40, 0.30,
  0.35, 0.40, 1.00, 0.45,
  0.20, 0.30, 0.45, 1.00
), 4, 4, byrow = TRUE)

Z <- matrix(rnorm(n * 4), n, 4) %*% chol(Sigma)

X_bin <- data.frame(
  b1 = Z[, 1] > qnorm(0.70),
  b2 = Z[, 2] > qnorm(0.55),
  b3 = Z[, 3] > qnorm(0.50)
)

X_ord <- data.frame(
  o1 = ordered(cut(Z[, 2], breaks = c(-Inf, -0.5, 0.4, Inf),
    labels = c("low", "mid", "high")
  )),
  o2 = ordered(cut(Z[, 3], breaks = c(-Inf, -1, 0, 1, Inf),
    labels = c("1", "2", "3", "4")
  ))
)

X_cont <- data.frame(x1 = Z[, 1], x2 = Z[, 4])

Binary-binary and ordinal-ordinal settings

tetrachoric() is used for binary variables. polychoric() is used for ordered categorical variables.

fit_tet <- tetrachoric(X_bin, ci = TRUE, p_value = TRUE)
fit_pol <- polychoric(X_ord, ci = TRUE, p_value = TRUE)

print(fit_tet, digits = 2)
summary(fit_pol)

These estimators assume a latent-normal threshold model. That assumption should be stated whenever the results are reported, because the interpretation is not simply "correlation between coded categories."

It is often useful to compare that latent estimate with a naive Pearson correlation computed on coded categories.

fit_bin_naive <- pearson_corr(data.frame(lapply(X_bin[, 1:2], as.numeric)))
fit_ord_naive <- pearson_corr(data.frame(lapply(X_ord, as.numeric)))

round(c(
  b1_b2_pearson = fit_bin_naive[1, 2],
  b1_b2_tetrachoric = fit_tet[1, 2],
  o1_o2_pearson = fit_ord_naive[1, 2],
  o1_o2_polychoric = fit_pol[1, 2]
), 2)

Those numbers need not agree. The latent estimators target the association between the underlying continuous variables, not the correlation between arbitrarily coded categories.

Mixed continuous-discrete settings

polyserial() is used when one variable is continuous and the other is ordinal. biserial() is used when one variable is continuous and the other is binary.

fit_ps <- polyserial(X_cont, X_ord, ci = TRUE, p_value = TRUE)
fit_bis <- biserial(X_cont, X_bin[, 1:2], ci = TRUE, p_value = TRUE)

summary(fit_ps)
summary(fit_bis)

Confidence intervals and p-values

These functions now follow the same user-facing pattern as the rest of the package:

estimates are returned by default;
confidence intervals are added only when ci = TRUE;
p-values are added only when p_value = TRUE where supported.

The important point is that inference is tied to the fitted latent model rather than to an ordinary Pearson-correlation formula applied to coded categories.