| tetrachoric | R Documentation |
Computes the tetrachoric correlation for either a pair of binary variables or all pairwise combinations of binary columns in a matrix/data frame.
tetrachoric(
data,
y = NULL,
na_method = c("error", "pairwise"),
ci = FALSE,
p_value = FALSE,
conf_level = 0.95,
correct = 0.5,
output = c("matrix", "sparse", "edge_list"),
threshold = 0,
diag = TRUE,
...
)
## S3 method for class 'tetrachoric_corr'
print(
x,
digits = 4,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
show_ci = NULL,
...
)
## S3 method for class 'tetrachoric_corr'
plot(
x,
title = "Tetrachoric correlation heatmap",
low_color = "indianred1",
high_color = "steelblue1",
mid_color = "white",
value_text_size = 4,
show_value = TRUE,
...
)
## S3 method for class 'tetrachoric_corr'
summary(
object,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
ci_digits = 3,
p_digits = 4,
show_ci = NULL,
...
)
## S3 method for class 'summary.tetrachoric_corr'
print(
x,
digits = NULL,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
show_ci = NULL,
...
)
data |
A binary vector, matrix, or data frame. In matrix/data-frame mode, only binary columns are retained. |
y |
Optional second binary vector. When supplied, the function returns a single tetrachoric correlation estimate. |
na_method |
Character scalar controlling missing-data handling.
|
ci |
Logical (default |
p_value |
Logical (default |
conf_level |
Confidence level used when |
correct |
Non-negative continuity correction added to zero-count cells.
Default is |
output |
Output representation for the computed estimates.
|
threshold |
Non-negative absolute-value filter for non-matrix outputs:
keep entries with |
diag |
Logical; whether to include diagonal entries in
|
... |
Additional arguments passed to |
x |
An object of class |
digits |
Integer; number of decimal places to print. |
n |
Optional row threshold for compact preview output. |
topn |
Optional number of leading/trailing rows to show when truncated. |
max_vars |
Optional maximum number of visible columns; |
width |
Optional display width; defaults to |
show_ci |
One of |
title |
Plot title. Default is |
low_color |
Color for the minimum correlation. |
high_color |
Color for the maximum correlation. |
mid_color |
Color for zero correlation. |
value_text_size |
Font size used in tile labels. |
show_value |
Logical; if |
object |
An object of class |
ci_digits |
Integer; digits for confidence limits in the pairwise summary. |
p_digits |
Integer; digits for p-values in the pairwise summary. |
The tetrachoric correlation assumes that the observed binary variables arise
by dichotomising latent standard-normal variables. Let
Z_1, Z_2 \sim N(0, 1) with latent correlation \rho, and define
observed binary variables by thresholds \tau_1, \tau_2:
X = \mathbf{1}\{Z_1 > \tau_1\},
\qquad
Y = \mathbf{1}\{Z_2 > \tau_2\}.
If the observed 2 \times 2 table has counts
n_{ij} for i,j \in \{0,1\}, the marginal proportions determine
the thresholds:
\tau_1 = \Phi^{-1}\!\big(P(X = 0)\big),
\qquad
\tau_2 = \Phi^{-1}\!\big(P(Y = 0)\big).
The estimator returned here is the maximum-likelihood estimate of the latent
correlation \rho, obtained by maximizing the multinomial log-likelihood
built from the rectangle probabilities of the bivariate normal distribution:
\ell(\rho) = \sum_{i=0}^1 \sum_{j=0}^1 n_{ij}\log \pi_{ij}(\rho;\tau_1,\tau_2),
where \pi_{ij} are the four bivariate-normal cell probabilities implied
by \rho and the fixed thresholds. The implementation evaluates the
likelihood over \rho \in (-1,1) by a coarse search followed by Brent
refinement in C++.
The argument correct adds a continuity correction only to zero-count
cells before threshold estimation and likelihood evaluation. This stabilises
the estimator for sparse tables and mirrors the conventional
correct = 0.5 continuity-correction behaviour used in several
latent-correlation implementations.
When correct = 0 and the observed contingency table contains zero
cells, the fit is non-regular and may be boundary-driven. In those cases the
returned object stores sparse-fit diagnostics, including whether the fit was
classified as boundary or near_boundary.
Assumptions. The coefficient is appropriate when both observed binary variables are viewed as thresholded versions of jointly normal latent variables. The optional p-values and confidence intervals adopt this latent-normal interpretation and use the same likelihood that defines the tetrachoric estimate. These inferential quantities are therefore model-based and should not be interpreted as distribution-free summaries.
Inference. When ci = TRUE or p_value = TRUE, the
function refits the pairwise tetrachoric model by maximum likelihood and
obtains the observed information matrix numerically in C++. The reported
confidence interval is a Wald interval
\hat\rho \pm z_{1-\alpha/2}\operatorname{SE}(\hat\rho), and the
reported p-value is from the large-sample Wald z-test for
H_0:\rho = 0. These inferential quantities are only computed when
explicitly requested.
In matrix/data-frame mode, all pairwise tetrachoric correlations are computed
between binary columns. Diagonal entries are 1 for non-degenerate
columns and NA for columns with fewer than two observed levels.
Variable-specific latent thresholds are stored in the thresholds
attribute, and pairwise sparse-fit diagnostics are stored in
diagnostics.
Computational complexity. For p binary variables, the matrix
path evaluates p(p-1)/2 pairwise likelihoods. Each pair uses a
one-dimensional optimisation with negligible memory overhead beyond the
output matrix.
If y is supplied, a numeric scalar with attributes
diagnostics and thresholds. Otherwise a symmetric matrix of
class tetrachoric_corr with attributes method,
description, package = "matrixCorr", diagnostics,
thresholds, and correct. When p_value = TRUE, the
returned object also carries an inference attribute with elements
estimate, statistic, parameter, p_value, and
n_obs. When ci = TRUE, it also carries a ci attribute
with elements est, lwr.ci, upr.ci, conf.level,
and ci.method, plus attr(x, "conf.level"). Scalar outputs keep
the same point estimate and gain the same metadata only when inference is
requested. In matrix mode, output = "edge_list" returns a data frame with columns
row, col, value; output = "sparse" returns a
symmetric sparse matrix.
Thiago de Paula Oliveira
Pearson, K. (1900). Mathematical contributions to the theory of evolution. VII. On the correlation of characters not quantitatively measurable. Philosophical Transactions of the Royal Society A, 195, 1-47.
Olsson, U. (1979). Maximum likelihood estimation of the polychoric correlation coefficient. Psychometrika, 44(4), 443-460.
set.seed(123)
n <- 1000
Sigma <- matrix(c(
1.00, 0.55, 0.35,
0.55, 1.00, 0.45,
0.35, 0.45, 1.00
), 3, 3, byrow = TRUE)
Z <- mnormt::rmnorm(n = n, mean = rep(0, 3), varcov = Sigma)
X <- data.frame(
item1 = Z[, 1] > stats::qnorm(0.70),
item2 = Z[, 2] > stats::qnorm(0.60),
item3 = Z[, 3] > stats::qnorm(0.50)
)
tc <- tetrachoric(X)
print(tc, digits = 3)
summary(tc)
plot(tc)
tetrachoric(X, output = "edge_list", diag = FALSE)
tetrachoric(X, output = "sparse", threshold = 0.4, diag = FALSE)
# Interactive viewing (requires shiny)
if (interactive() && requireNamespace("shiny", quietly = TRUE)) {
view_corr_shiny(tc)
}
# latent Pearson correlations used to generate the binary items
round(stats::cor(Z), 2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.