biserial: Biserial Correlation Between Continuous and Binary Variables

View source: R/latent_corr.R

biserialR Documentation

Biserial Correlation Between Continuous and Binary Variables

Description

Computes biserial correlations between continuous variables in data and binary variables in y. Both pairwise vector mode and rectangular matrix/data-frame mode are supported.

Usage

biserial(data, y, na_method = c("error", "pairwise"), ci = FALSE, p_value = FALSE,
  conf_level = 0.95, ...)

## S3 method for class 'biserial_corr'
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

## S3 method for class 'biserial_corr'
plot(
  x,
  title = "Biserial correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  ci_text_size = 3,
  show_value = TRUE,
  ...
)

## S3 method for class 'biserial_corr'
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  p_digits = 4,
  show_ci = NULL,
  ...
)

## S3 method for class 'summary.biserial_corr'
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Arguments

data

A numeric vector, matrix, or data frame containing continuous variables.

y

A binary vector, matrix, or data frame. In data-frame mode, only two-level columns are retained.

na_method

Character scalar controlling missing-data handling. "error" rejects missing values. "pairwise" uses pairwise complete cases.

ci

Logical (default FALSE). If TRUE, attach approximate large-sample confidence intervals derived from a Fisher z-transformation of the biserial estimate.

p_value

Logical (default FALSE). If TRUE, attach model-based large-sample p-values, test statistics, and degrees of freedom for each biserial estimate.

conf_level

Confidence level used when ci = TRUE. Default is 0.95.

...

Additional arguments passed to print().

x

An object of class summary.biserial_corr.

digits

Integer; number of decimal places to print.

n

Optional row threshold for compact preview output.

topn

Optional number of leading/trailing rows to show when truncated.

max_vars

Optional maximum number of visible columns; NULL derives this from console width.

width

Optional display width; defaults to getOption("width").

show_ci

One of "yes" or "no".

title

Plot title. Default is "Biserial correlation heatmap".

low_color

Color for the minimum correlation.

high_color

Color for the maximum correlation.

mid_color

Color for zero correlation.

value_text_size

Font size used in tile labels.

ci_text_size

Text size for confidence intervals in the heatmap.

show_value

Logical; if TRUE (default), overlay numeric values on the heatmap tiles.

object

An object of class biserial_corr.

ci_digits

Integer; digits for biserial confidence limits in the pairwise summary.

p_digits

Integer; digits for biserial p-values in the pairwise summary.

Details

The biserial correlation is the special two-category case of the polyserial model. It assumes that a binary variable Y arises by thresholding an unobserved standard-normal variable Z that is jointly normal with a continuous variable X. Writing p = P(Y = 1) and q = 1-p, let z_p = \Phi^{-1}(p) and \phi(z_p) be the standard-normal density evaluated at z_p. If \bar x_1 and \bar x_0 denote the sample means of X in the two observed groups and s_x is the sample standard deviation of X, the usual biserial estimator is

r_b = \frac{\bar x_1 - \bar x_0}{s_x} \frac{pq}{\phi(z_p)}.

This is exactly the estimator implemented in the underlying C++ kernel.

Assumptions. The biserial coefficient is appropriate when the observed binary variable is viewed as a thresholded version of an unobserved continuous latent variable that is jointly normal with the observed continuous variable. The optional p-values and confidence intervals adopt this latent-normal interpretation together with the usual large-sample approximations used for correlation coefficients. These inferential quantities are therefore model-based and should not be interpreted as distribution-free summaries.

Inference. When p_value = TRUE, the package reports the large-sample t-statistic

t = r_b \sqrt{\frac{n - 2}{1 - r_b^2}},

referenced to a Student t-distribution with n - 2 degrees of freedom. When ci = TRUE, the package forms an approximate Fisher z-interval by transforming r_b with z = \operatorname{atanh}(r_b), using standard error 1 / \sqrt{n - 3}, and mapping the limits back with \tanh(\cdot). The CI is therefore an internal large-sample extension and is only computed when explicitly requested.

In vector mode a single biserial correlation is returned. In matrix/data-frame mode, every numeric column of data is paired with every binary column of y, producing a rectangular matrix of continuous-by-binary biserial correlations.

Unlike the point-biserial correlation, which is just Pearson correlation on a 0/1 coding of the binary variable, the biserial coefficient explicitly assumes an underlying latent normal threshold model and rescales the mean difference accordingly.

Computational complexity. If data has p_x continuous columns and y has p_y binary columns, the matrix path computes p_x p_y closed-form estimates with negligible extra memory beyond the output matrix.

Value

If both data and y are vectors, a numeric scalar. Otherwise a numeric matrix of class biserial_corr with rows corresponding to the continuous variables in data and columns to the binary variables in y. Matrix outputs carry attributes method, description, and package = "matrixCorr". When p_value = TRUE, the object also carries an inference attribute with matrices estimate, statistic, parameter, p_value, and n_obs. When ci = TRUE, it additionally carries a ci attribute with matrices lwr.ci and upr.ci, plus attr(x, "conf.level"). Scalar outputs keep the same point estimate and gain the same metadata only when inference is requested.

Author(s)

Thiago de Paula Oliveira

References

Olsson, U., Drasgow, F., & Dorans, N. J. (1982). The polyserial correlation coefficient. Psychometrika, 47(3), 337-347.

Fisher, R. A. (1921). On the probable error of a coefficient of correlation deduced from a small sample. Metron, 1, 3-32.

Examples


set.seed(126)
n <- 1000
Sigma <- matrix(c(
  1.00, 0.35, 0.50, 0.25,
  0.35, 1.00, 0.30, 0.55,
  0.50, 0.30, 1.00, 0.40,
  0.25, 0.55, 0.40, 1.00
), 4, 4, byrow = TRUE)

Z <- mnormt::rmnorm(n = n, mean = rep(0, 4), varcov = Sigma)
X <- data.frame(x1 = Z[, 1], x2 = Z[, 2])
Y <- data.frame(
  g1 = Z[, 3] > stats::qnorm(0.65),
  g2 = Z[, 4] > stats::qnorm(0.55)
)

bs <- biserial(X, Y, ci = TRUE, p_value = TRUE)
print(bs, digits = 3)
summary(bs)
plot(bs)


matrixCorr documentation built on April 18, 2026, 5:06 p.m.