knitr::opts_chunk$set( collapse = TRUE, comment = "#>", warning = FALSE, message = FALSE )
This vignette covers the latent-correlation estimators used when the observed data are binary, ordinal, or mixed. These methods do not target the same quantity as an ordinary Pearson correlation on coded categories. They are designed for settings where the observed variables are treated as thresholded versions of latent continuous variables.
The relevant functions are:
tetrachoric()polychoric()polyserial()biserial()library(matrixCorr) set.seed(30) n <- 500 Sigma <- matrix(c( 1.00, 0.55, 0.35, 0.20, 0.55, 1.00, 0.40, 0.30, 0.35, 0.40, 1.00, 0.45, 0.20, 0.30, 0.45, 1.00 ), 4, 4, byrow = TRUE) Z <- matrix(rnorm(n * 4), n, 4) %*% chol(Sigma) X_bin <- data.frame( b1 = Z[, 1] > qnorm(0.70), b2 = Z[, 2] > qnorm(0.55), b3 = Z[, 3] > qnorm(0.50) ) X_ord <- data.frame( o1 = ordered(cut(Z[, 2], breaks = c(-Inf, -0.5, 0.4, Inf), labels = c("low", "mid", "high") )), o2 = ordered(cut(Z[, 3], breaks = c(-Inf, -1, 0, 1, Inf), labels = c("1", "2", "3", "4") )) ) X_cont <- data.frame(x1 = Z[, 1], x2 = Z[, 4])
tetrachoric() is used for binary variables. polychoric() is used for
ordered categorical variables.
fit_tet <- tetrachoric(X_bin, ci = TRUE, p_value = TRUE) fit_pol <- polychoric(X_ord, ci = TRUE, p_value = TRUE) print(fit_tet, digits = 2) summary(fit_pol)
These estimators assume a latent-normal threshold model. That assumption should be stated whenever the results are reported, because the interpretation is not simply "correlation between coded categories."
It is often useful to compare that latent estimate with a naive Pearson correlation computed on coded categories.
fit_bin_naive <- pearson_corr(data.frame(lapply(X_bin[, 1:2], as.numeric))) fit_ord_naive <- pearson_corr(data.frame(lapply(X_ord, as.numeric))) round(c( b1_b2_pearson = fit_bin_naive[1, 2], b1_b2_tetrachoric = fit_tet[1, 2], o1_o2_pearson = fit_ord_naive[1, 2], o1_o2_polychoric = fit_pol[1, 2] ), 2)
Those numbers need not agree. The latent estimators target the association between the underlying continuous variables, not the correlation between arbitrarily coded categories.
polyserial() is used when one variable is continuous and the other is
ordinal. biserial() is used when one variable is continuous and the other is
binary.
fit_ps <- polyserial(X_cont, X_ord, ci = TRUE, p_value = TRUE) fit_bis <- biserial(X_cont, X_bin[, 1:2], ci = TRUE, p_value = TRUE) summary(fit_ps) summary(fit_bis)
These functions now follow the same user-facing pattern as the rest of the package:
ci = TRUE;p_value = TRUE where supported.The important point is that inference is tied to the fitted latent model rather than to an ordinary Pearson-correlation formula applied to coded categories.
These estimators are appropriate when the scientific question is explicitly about latent association under a threshold model.
tetrachoric() for binary-binary pairs.polychoric() for ordinal-ordinal pairs.polyserial() for continuous-ordinal pairs.biserial() for continuous-binary pairs.If the variables are nominal rather than ordered, these latent-correlation functions are not the right tools.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.