biserial: Biserial Correlation Between Continuous and Binary Variables
In matrixCorr: Collection of Correlation and Association Estimators

biserial

R Documentation

Biserial Correlation Between Continuous and Binary Variables

Description

Computes biserial correlations between continuous variables in data and binary variables in y. Both pairwise vector mode and rectangular matrix/data-frame mode are supported.

Usage

biserial(data, y, na_method = c("error", "pairwise"), ci = FALSE, p_value = FALSE,
  conf_level = 0.95, ...)

## S3 method for class 'biserial_corr'
print(
  x,
  digits = 4,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

## S3 method for class 'biserial_corr'
plot(
  x,
  title = "Biserial correlation heatmap",
  low_color = "indianred1",
  high_color = "steelblue1",
  mid_color = "white",
  value_text_size = 4,
  ci_text_size = 3,
  show_value = TRUE,
  ...
)

## S3 method for class 'biserial_corr'
summary(
  object,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  ci_digits = 3,
  p_digits = 4,
  show_ci = NULL,
  ...
)

## S3 method for class 'summary.biserial_corr'
print(
  x,
  digits = NULL,
  n = NULL,
  topn = NULL,
  max_vars = NULL,
  width = NULL,
  show_ci = NULL,
  ...
)

Arguments

`data`	A numeric vector, matrix, or data frame containing continuous variables.
`y`	A binary vector, matrix, or data frame. In data-frame mode, only two-level columns are retained.
`na_method`	Character scalar controlling missing-data handling. `"error"` rejects missing values. `"pairwise"` uses pairwise complete cases.
`ci`	Logical (default `FALSE`). If `TRUE`, attach approximate large-sample confidence intervals derived from a Fisher `z`-transformation of the biserial estimate.
`p_value`	Logical (default `FALSE`). If `TRUE`, attach model-based large-sample p-values, test statistics, and degrees of freedom for each biserial estimate.
`conf_level`	Confidence level used when `ci = TRUE`. Default is `0.95`.
`...`	Additional arguments passed to `print()`.
`x`	An object of class `summary.biserial_corr`.
`digits`	Integer; number of decimal places to print.
`n`	Optional row threshold for compact preview output.
`topn`	Optional number of leading/trailing rows to show when truncated.
`max_vars`	Optional maximum number of visible columns; `NULL` derives this from console width.
`width`	Optional display width; defaults to `getOption("width")`.
`show_ci`	One of `"yes"` or `"no"`.
`title`	Plot title. Default is `"Biserial correlation heatmap"`.
`low_color`	Color for the minimum correlation.
`high_color`	Color for the maximum correlation.
`mid_color`	Color for zero correlation.
`value_text_size`	Font size used in tile labels.
`ci_text_size`	Text size for confidence intervals in the heatmap.
`show_value`	Logical; if `TRUE` (default), overlay numeric values on the heatmap tiles.
`object`	An object of class `biserial_corr`.
`ci_digits`	Integer; digits for biserial confidence limits in the pairwise summary.
`p_digits`	Integer; digits for biserial p-values in the pairwise summary.

Details

The biserial correlation is the special two-category case of the polyserial model. It assumes that a binary variable Y arises by thresholding an unobserved standard-normal variable Z that is jointly normal with a continuous variable X. Writing p = P(Y = 1) and q = 1-p, let z_p = \Phi^{-1}(p) and \phi(z_p) be the standard-normal density evaluated at z_p. If \bar x_1 and \bar x_0 denote the sample means of X in the two observed groups and s_x is the sample standard deviation of X, the usual biserial estimator is

r_b = \frac{\bar x_1 - \bar x_0}{s_x} \frac{pq}{\phi(z_p)}.

This is exactly the estimator implemented in the underlying C++ kernel.

Assumptions. The biserial coefficient is appropriate when the observed binary variable is viewed as a thresholded version of an unobserved continuous latent variable that is jointly normal with the observed continuous variable. The optional p-values and confidence intervals adopt this latent-normal interpretation together with the usual large-sample approximations used for correlation coefficients. These inferential quantities are therefore model-based and should not be interpreted as distribution-free summaries.

Inference. When p_value = TRUE, the package reports the large-sample t-statistic

t = r_b \sqrt{\frac{n - 2}{1 - r_b^2}},

referenced to a Student t-distribution with n - 2 degrees of freedom. When ci = TRUE, the package forms an approximate Fisher z-interval by transforming r_b with z = \operatorname{atanh}(r_b), using standard error 1 / \sqrt{n - 3}, and mapping the limits back with \tanh(\cdot). The CI is therefore an internal large-sample extension and is only computed when explicitly requested.

In vector mode a single biserial correlation is returned. In matrix/data-frame mode, every numeric column of data is paired with every binary column of y, producing a rectangular matrix of continuous-by-binary biserial correlations.

Unlike the point-biserial correlation, which is just Pearson correlation on a 0/1 coding of the binary variable, the biserial coefficient explicitly assumes an underlying latent normal threshold model and rescales the mean difference accordingly.

Computational complexity. If data has p_x continuous columns and y has p_y binary columns, the matrix path computes p_x p_y closed-form estimates with negligible extra memory beyond the output matrix.

Value

If both data and y are vectors, a numeric scalar. Otherwise a numeric matrix of class biserial_corr with rows corresponding to the continuous variables in data and columns to the binary variables in y. Matrix outputs carry attributes method, description, and package = "matrixCorr". When p_value = TRUE, the object also carries an inference attribute with matrices estimate, statistic, parameter, p_value, and n_obs. When ci = TRUE, it additionally carries a ci attribute with matrices lwr.ci and upr.ci, plus attr(x, "conf.level"). Scalar outputs keep the same point estimate and gain the same metadata only when inference is requested.

Author(s)

Thiago de Paula Oliveira

References

Olsson, U., Drasgow, F., & Dorans, N. J. (1982). The polyserial correlation coefficient. Psychometrika, 47(3), 337-347.

Fisher, R. A. (1921). On the probable error of a coefficient of correlation deduced from a small sample. Metron, 1, 3-32.

Examples


set.seed(126)
n <- 1000
Sigma <- matrix(c(
  1.00, 0.35, 0.50, 0.25,
  0.35, 1.00, 0.30, 0.55,
  0.50, 0.30, 1.00, 0.40,
  0.25, 0.55, 0.40, 1.00
), 4, 4, byrow = TRUE)

Z <- mnormt::rmnorm(n = n, mean = rep(0, 4), varcov = Sigma)
X <- data.frame(x1 = Z[, 1], x2 = Z[, 2])
Y <- data.frame(
  g1 = Z[, 3] > stats::qnorm(0.65),
  g2 = Z[, 4] > stats::qnorm(0.55)
)

bs <- biserial(X, Y, ci = TRUE, p_value = TRUE)
print(bs, digits = 3)
summary(bs)
plot(bs)

matrixCorr documentation built on April 18, 2026, 5:06 p.m.