tetrachoric: Pairwise Tetrachoric Correlation
In matrixCorr: Collection of Correlation and Association Estimators

tetrachoric

R Documentation

Pairwise Tetrachoric Correlation

Description

Computes the tetrachoric correlation for either a pair of binary variables or all pairwise combinations of binary columns in a matrix/data frame.

Usage

tetrachoric(
  data,
  y = NULL,
  na_method = c("error", "pairwise"),
  ci = FALSE,
  p_value = FALSE,
  conf_level = 0.95,
  correct = 0.5,
  output = c("matrix", "sparse", "edge_list"),
  threshold = 0,
  diag = TRUE,
  ...
)

## S3 method for class 'tetrachoric_corr'
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

## S3 method for class 'tetrachoric_corr'
plot(
  x,
  title = "Tetrachoric correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  show_value = TRUE,
  ...
)

## S3 method for class 'tetrachoric_corr'
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  p_digits = 4,
  show_ci = NULL,
  ...
)

## S3 method for class 'summary.tetrachoric_corr'
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Arguments

`data`	A binary vector, matrix, or data frame. In matrix/data-frame mode, only binary columns are retained.
`y`	Optional second binary vector. When supplied, the function returns a single tetrachoric correlation estimate.
`na_method`	Character scalar controlling missing-data handling. `"error"` rejects missing values. `"pairwise"` uses pairwise complete cases.
`ci`	Logical (default `FALSE`). If `TRUE`, attach model-based large-sample Wald confidence intervals derived from the observed information matrix of the latent-variable likelihood.
`p_value`	Logical (default `FALSE`). If `TRUE`, attach model-based large-sample Wald p-values and test statistics for each estimated latent correlation.
`conf_level`	Confidence level used when `ci = TRUE`. Default is `0.95`.
`correct`	Non-negative continuity correction added to zero-count cells. Default is `0.5`.
`output`	Output representation for the computed estimates. `"matrix"` (default): full dense matrix; best when you need matrix algebra, dense heatmaps, or full compatibility with existing code. `"sparse"`: sparse matrix from Matrix containing only retained entries; best when many values are dropped by thresholding. `"edge_list"`: long-form data frame with columns `row`, `col`, `value`; convenient for filtering, joins, and network-style workflows.
`threshold`	Non-negative absolute-value filter for non-matrix outputs: keep entries with `abs(value) >= threshold`. Use `threshold > 0` when you want only stronger associations (typically with `output = "sparse"` or `"edge_list"`). Keep `threshold = 0` to retain all values. Must be `0` when `output = "matrix"`.
`diag`	Logical; whether to include diagonal entries in `"sparse"` and `"edge_list"` outputs.
`...`	Additional arguments passed to `print()`.
`x`	An object of class `summary.tetrachoric_corr`.
`digits`	Integer; number of decimal places to print.
`n`	Optional row threshold for compact preview output.
`topn`	Optional number of leading/trailing rows to show when truncated.
`max_vars`	Optional maximum number of visible columns; `NULL` derives this from console width.
`width`	Optional display width; defaults to `getOption("width")`.
`show_ci`	One of `"yes"` or `"no"`.
`title`	Plot title. Default is `"Tetrachoric correlation heatmap"`.
`low_color`	Color for the minimum correlation.
`high_color`	Color for the maximum correlation.
`mid_color`	Color for zero correlation.
`value_text_size`	Font size used in tile labels.
`show_value`	Logical; if `TRUE` (default), overlay numeric values on the heatmap tiles.
`object`	An object of class `tetrachoric_corr`.
`ci_digits`	Integer; digits for confidence limits in the pairwise summary.
`p_digits`	Integer; digits for p-values in the pairwise summary.

Details

The tetrachoric correlation assumes that the observed binary variables arise by dichotomising latent standard-normal variables. Let Z_1, Z_2 \sim N(0, 1) with latent correlation \rho, and define observed binary variables by thresholds \tau_1, \tau_2:

X = \mathbf{1}\{Z_1 > \tau_1\}, \qquad Y = \mathbf{1}\{Z_2 > \tau_2\}.

If the observed 2 \times 2 table has counts n_{ij} for i,j \in \{0,1\}, the marginal proportions determine the thresholds:

\tau_1 = \Phi^{-1}\!\big(P(X = 0)\big), \qquad \tau_2 = \Phi^{-1}\!\big(P(Y = 0)\big).

The estimator returned here is the maximum-likelihood estimate of the latent correlation \rho, obtained by maximizing the multinomial log-likelihood built from the rectangle probabilities of the bivariate normal distribution:

\ell(\rho) = \sum_{i=0}^1 \sum_{j=0}^1 n_{ij}\log \pi_{ij}(\rho;\tau_1,\tau_2),

where \pi_{ij} are the four bivariate-normal cell probabilities implied by \rho and the fixed thresholds. The implementation evaluates the likelihood over \rho \in (-1,1) by a coarse search followed by Brent refinement in C++.

The argument correct adds a continuity correction only to zero-count cells before threshold estimation and likelihood evaluation. This stabilises the estimator for sparse tables and mirrors the conventional correct = 0.5 continuity-correction behaviour used in several latent-correlation implementations. When correct = 0 and the observed contingency table contains zero cells, the fit is non-regular and may be boundary-driven. In those cases the returned object stores sparse-fit diagnostics, including whether the fit was classified as boundary or near_boundary.

Assumptions. The coefficient is appropriate when both observed binary variables are viewed as thresholded versions of jointly normal latent variables. The optional p-values and confidence intervals adopt this latent-normal interpretation and use the same likelihood that defines the tetrachoric estimate. These inferential quantities are therefore model-based and should not be interpreted as distribution-free summaries.

Inference. When ci = TRUE or p_value = TRUE, the function refits the pairwise tetrachoric model by maximum likelihood and obtains the observed information matrix numerically in C++. The reported confidence interval is a Wald interval \hat\rho \pm z_{1-\alpha/2}\operatorname{SE}(\hat\rho), and the reported p-value is from the large-sample Wald z-test for H_0:\rho = 0. These inferential quantities are only computed when explicitly requested.

In matrix/data-frame mode, all pairwise tetrachoric correlations are computed between binary columns. Diagonal entries are 1 for non-degenerate columns and NA for columns with fewer than two observed levels. Variable-specific latent thresholds are stored in the thresholds attribute, and pairwise sparse-fit diagnostics are stored in diagnostics.

Computational complexity. For p binary variables, the matrix path evaluates p(p-1)/2 pairwise likelihoods. Each pair uses a one-dimensional optimisation with negligible memory overhead beyond the output matrix.

Value

If y is supplied, a numeric scalar with attributes diagnostics and thresholds. Otherwise a symmetric matrix of class tetrachoric_corr with attributes method, description, package = "matrixCorr", diagnostics, thresholds, and correct. When p_value = TRUE, the returned object also carries an inference attribute with elements estimate, statistic, parameter, p_value, and n_obs. When ci = TRUE, it also carries a ci attribute with elements est, lwr.ci, upr.ci, conf.level, and ci.method, plus attr(x, "conf.level"). Scalar outputs keep the same point estimate and gain the same metadata only when inference is requested. In matrix mode, output = "edge_list" returns a data frame with columns row, col, value; output = "sparse" returns a symmetric sparse matrix.

Author(s)

Thiago de Paula Oliveira

References

Pearson, K. (1900). Mathematical contributions to the theory of evolution. VII. On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society A, 195, 1-47.

Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44(4), 443-460.

Examples


set.seed(123)
n <- 1000
Sigma <- matrix(c(
  1.00, 0.55, 0.35,
  0.55, 1.00, 0.45,
  0.35, 0.45, 1.00
), 3, 3, byrow = TRUE)

Z <- mnormt::rmnorm(n = n, mean = rep(0, 3), varcov = Sigma)
X <- data.frame(
  item1 = Z[, 1] > stats::qnorm(0.70),
  item2 = Z[, 2] > stats::qnorm(0.60),
  item3 = Z[, 3] > stats::qnorm(0.50)
)

tc <- tetrachoric(X)
print(tc, digits = 3)
summary(tc)
plot(tc)
tetrachoric(X, output = "edge_list", diag = FALSE)
tetrachoric(X, output = "sparse", threshold = 0.4, diag = FALSE)

# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
  view_corr_shiny(tc)
}

# latent Pearson correlations used to generate the binary items
round(stats::cor(Z), 2)

matrixCorr documentation built on April 18, 2026, 5:06 p.m.