tetrachoric: Pairwise Tetrachoric Correlation

View source: R/latent_corr.R

tetrachoricR Documentation

Pairwise Tetrachoric Correlation

Description

Computes the tetrachoric correlation for either a pair of binary variables or all pairwise combinations of binary columns in a matrix/data frame.

Usage

tetrachoric(
  data,
  y = NULL,
  na_method = c("error", "pairwise"),
  ci = FALSE,
  p_value = FALSE,
  conf_level = 0.95,
  correct = 0.5,
  output = c("matrix", "sparse", "edge_list"),
  threshold = 0,
  diag = TRUE,
  ...
)

## S3 method for class 'tetrachoric_corr'
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

## S3 method for class 'tetrachoric_corr'
plot(
  x,
  title = "Tetrachoric correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  show_value = TRUE,
  ...
)

## S3 method for class 'tetrachoric_corr'
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  p_digits = 4,
  show_ci = NULL,
  ...
)

## S3 method for class 'summary.tetrachoric_corr'
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Arguments

data

A binary vector, matrix, or data frame. In matrix/data-frame mode, only binary columns are retained.

y

Optional second binary vector. When supplied, the function returns a single tetrachoric correlation estimate.

na_method

Character scalar controlling missing-data handling. "error" rejects missing values. "pairwise" uses pairwise complete cases.

ci

Logical (default FALSE). If TRUE, attach model-based large-sample Wald confidence intervals derived from the observed information matrix of the latent-variable likelihood.

p_value

Logical (default FALSE). If TRUE, attach model-based large-sample Wald p-values and test statistics for each estimated latent correlation.

conf_level

Confidence level used when ci = TRUE. Default is 0.95.

correct

Non-negative continuity correction added to zero-count cells. Default is 0.5.

output

Output representation for the computed estimates.

  • "matrix" (default): full dense matrix; best when you need matrix algebra, dense heatmaps, or full compatibility with existing code.

  • "sparse": sparse matrix from Matrix containing only retained entries; best when many values are dropped by thresholding.

  • "edge_list": long-form data frame with columns row, col, value; convenient for filtering, joins, and network-style workflows.

threshold

Non-negative absolute-value filter for non-matrix outputs: keep entries with abs(value) >= threshold. Use threshold > 0 when you want only stronger associations (typically with output = "sparse" or "edge_list"). Keep threshold = 0 to retain all values. Must be 0 when output = "matrix".

diag

Logical; whether to include diagonal entries in "sparse" and "edge_list" outputs.

...

Additional arguments passed to print().

x

An object of class summary.tetrachoric_corr.

digits

Integer; number of decimal places to print.

n

Optional row threshold for compact preview output.

topn

Optional number of leading/trailing rows to show when truncated.

max_vars

Optional maximum number of visible columns; NULL derives this from console width.

width

Optional display width; defaults to getOption("width").

show_ci

One of "yes" or "no".

title

Plot title. Default is "Tetrachoric correlation heatmap".

low_color

Color for the minimum correlation.

high_color

Color for the maximum correlation.

mid_color

Color for zero correlation.

value_text_size

Font size used in tile labels.

show_value

Logical; if TRUE (default), overlay numeric values on the heatmap tiles.

object

An object of class tetrachoric_corr.

ci_digits

Integer; digits for confidence limits in the pairwise summary.

p_digits

Integer; digits for p-values in the pairwise summary.

Details

The tetrachoric correlation assumes that the observed binary variables arise by dichotomising latent standard-normal variables. Let Z_1, Z_2 \sim N(0, 1) with latent correlation \rho, and define observed binary variables by thresholds \tau_1, \tau_2:

X = \mathbf{1}\{Z_1 > \tau_1\}, \qquad Y = \mathbf{1}\{Z_2 > \tau_2\}.

If the observed 2 \times 2 table has counts n_{ij} for i,j \in \{0,1\}, the marginal proportions determine the thresholds:

\tau_1 = \Phi^{-1}\!\big(P(X = 0)\big), \qquad \tau_2 = \Phi^{-1}\!\big(P(Y = 0)\big).

The estimator returned here is the maximum-likelihood estimate of the latent correlation \rho, obtained by maximizing the multinomial log-likelihood built from the rectangle probabilities of the bivariate normal distribution:

\ell(\rho) = \sum_{i=0}^1 \sum_{j=0}^1 n_{ij}\log \pi_{ij}(\rho;\tau_1,\tau_2),

where \pi_{ij} are the four bivariate-normal cell probabilities implied by \rho and the fixed thresholds. The implementation evaluates the likelihood over \rho \in (-1,1) by a coarse search followed by Brent refinement in C++.

The argument correct adds a continuity correction only to zero-count cells before threshold estimation and likelihood evaluation. This stabilises the estimator for sparse tables and mirrors the conventional correct = 0.5 continuity-correction behaviour used in several latent-correlation implementations. When correct = 0 and the observed contingency table contains zero cells, the fit is non-regular and may be boundary-driven. In those cases the returned object stores sparse-fit diagnostics, including whether the fit was classified as boundary or near_boundary.

Assumptions. The coefficient is appropriate when both observed binary variables are viewed as thresholded versions of jointly normal latent variables. The optional p-values and confidence intervals adopt this latent-normal interpretation and use the same likelihood that defines the tetrachoric estimate. These inferential quantities are therefore model-based and should not be interpreted as distribution-free summaries.

Inference. When ci = TRUE or p_value = TRUE, the function refits the pairwise tetrachoric model by maximum likelihood and obtains the observed information matrix numerically in C++. The reported confidence interval is a Wald interval \hat\rho \pm z_{1-\alpha/2}\operatorname{SE}(\hat\rho), and the reported p-value is from the large-sample Wald z-test for H_0:\rho = 0. These inferential quantities are only computed when explicitly requested.

In matrix/data-frame mode, all pairwise tetrachoric correlations are computed between binary columns. Diagonal entries are 1 for non-degenerate columns and NA for columns with fewer than two observed levels. Variable-specific latent thresholds are stored in the thresholds attribute, and pairwise sparse-fit diagnostics are stored in diagnostics.

Computational complexity. For p binary variables, the matrix path evaluates p(p-1)/2 pairwise likelihoods. Each pair uses a one-dimensional optimisation with negligible memory overhead beyond the output matrix.

Value

If y is supplied, a numeric scalar with attributes diagnostics and thresholds. Otherwise a symmetric matrix of class tetrachoric_corr with attributes method, description, package = "matrixCorr", diagnostics, thresholds, and correct. When p_value = TRUE, the returned object also carries an inference attribute with elements estimate, statistic, parameter, p_value, and n_obs. When ci = TRUE, it also carries a ci attribute with elements est, lwr.ci, upr.ci, conf.level, and ci.method, plus attr(x, "conf.level"). Scalar outputs keep the same point estimate and gain the same metadata only when inference is requested. In matrix mode, output = "edge_list" returns a data frame with columns row, col, value; output = "sparse" returns a symmetric sparse matrix.

Author(s)

Thiago de Paula Oliveira

References

Pearson, K. (1900). Mathematical contributions to the theory of evolution. VII. On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society A, 195, 1-47.

Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44(4), 443-460.

Examples


set.seed(123)
n <- 1000
Sigma <- matrix(c(
  1.00, 0.55, 0.35,
  0.55, 1.00, 0.45,
  0.35, 0.45, 1.00
), 3, 3, byrow = TRUE)

Z <- mnormt::rmnorm(n = n, mean = rep(0, 3), varcov = Sigma)
X <- data.frame(
  item1 = Z[, 1] > stats::qnorm(0.70),
  item2 = Z[, 2] > stats::qnorm(0.60),
  item3 = Z[, 3] > stats::qnorm(0.50)
)

tc <- tetrachoric(X)
print(tc, digits = 3)
summary(tc)
plot(tc)
tetrachoric(X, output = "edge_list", diag = FALSE)
tetrachoric(X, output = "sparse", threshold = 0.4, diag = FALSE)

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(tc)
}

# latent Pearson correlations used to generate the binary items
round(stats::cor(Z), 2)


matrixCorr documentation built on April 18, 2026, 5:06 p.m.