| biserial | R Documentation |
Computes biserial correlations between continuous variables in data
and binary variables in y. Both pairwise vector mode and rectangular
matrix/data-frame mode are supported.
biserial(data, y, na_method = c("error", "pairwise"), ci = FALSE, p_value = FALSE,
conf_level = 0.95, ...)
## S3 method for class 'biserial_corr'
print(
x,
digits = 4,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
show_ci = NULL,
...
)
## S3 method for class 'biserial_corr'
plot(
x,
title = "Biserial correlation heatmap",
low_color = "indianred1",
high_color = "steelblue1",
mid_color = "white",
value_text_size = 4,
ci_text_size = 3,
show_value = TRUE,
...
)
## S3 method for class 'biserial_corr'
summary(
object,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
ci_digits = 3,
p_digits = 4,
show_ci = NULL,
...
)
## S3 method for class 'summary.biserial_corr'
print(
x,
digits = NULL,
n = NULL,
topn = NULL,
max_vars = NULL,
width = NULL,
show_ci = NULL,
...
)
data |
A numeric vector, matrix, or data frame containing continuous variables. |
y |
A binary vector, matrix, or data frame. In data-frame mode, only two-level columns are retained. |
na_method |
Character scalar controlling missing-data handling.
|
ci |
Logical (default |
p_value |
Logical (default |
conf_level |
Confidence level used when |
... |
Additional arguments passed to |
x |
An object of class |
digits |
Integer; number of decimal places to print. |
n |
Optional row threshold for compact preview output. |
topn |
Optional number of leading/trailing rows to show when truncated. |
max_vars |
Optional maximum number of visible columns; |
width |
Optional display width; defaults to |
show_ci |
One of |
title |
Plot title. Default is |
low_color |
Color for the minimum correlation. |
high_color |
Color for the maximum correlation. |
mid_color |
Color for zero correlation. |
value_text_size |
Font size used in tile labels. |
ci_text_size |
Text size for confidence intervals in the heatmap. |
show_value |
Logical; if |
object |
An object of class |
ci_digits |
Integer; digits for biserial confidence limits in the pairwise summary. |
p_digits |
Integer; digits for biserial p-values in the pairwise summary. |
The biserial correlation is the special two-category case of the polyserial
model. It assumes that a binary variable Y arises by thresholding an
unobserved standard-normal variable Z that is jointly normal with a
continuous variable X. Writing p = P(Y = 1) and
q = 1-p, let z_p = \Phi^{-1}(p) and \phi(z_p) be the
standard-normal density evaluated at z_p. If \bar x_1 and
\bar x_0 denote the sample means of X in the two observed groups
and s_x is the sample standard deviation of X, the usual
biserial estimator is
r_b =
\frac{\bar x_1 - \bar x_0}{s_x}
\frac{pq}{\phi(z_p)}.
This is exactly the estimator implemented in the underlying C++ kernel.
Assumptions. The biserial coefficient is appropriate when the observed binary variable is viewed as a thresholded version of an unobserved continuous latent variable that is jointly normal with the observed continuous variable. The optional p-values and confidence intervals adopt this latent-normal interpretation together with the usual large-sample approximations used for correlation coefficients. These inferential quantities are therefore model-based and should not be interpreted as distribution-free summaries.
Inference. When p_value = TRUE, the package reports the
large-sample t-statistic
t = r_b \sqrt{\frac{n - 2}{1 - r_b^2}},
referenced to a Student t-distribution with n - 2 degrees of
freedom. When ci = TRUE, the package forms an approximate Fisher
z-interval by transforming r_b with
z = \operatorname{atanh}(r_b), using standard error
1 / \sqrt{n - 3}, and mapping the limits back with
\tanh(\cdot). The CI is therefore an internal large-sample
extension and is only computed when explicitly requested.
In vector mode a single biserial correlation is returned. In
matrix/data-frame mode, every numeric column of data is paired with every
binary column of y, producing a rectangular matrix of
continuous-by-binary biserial correlations.
Unlike the point-biserial correlation, which is just Pearson correlation on a 0/1 coding of the binary variable, the biserial coefficient explicitly assumes an underlying latent normal threshold model and rescales the mean difference accordingly.
Computational complexity. If data has p_x continuous
columns and y has p_y binary columns, the matrix path computes
p_x p_y closed-form estimates with negligible extra memory beyond the
output matrix.
If both data and y are vectors, a numeric scalar. Otherwise a
numeric matrix of class biserial_corr with rows corresponding to
the continuous variables in data and columns to the binary variables
in y. Matrix outputs carry attributes method,
description, and package = "matrixCorr". When
p_value = TRUE, the object also carries an inference
attribute with matrices estimate, statistic,
parameter, p_value, and n_obs. When ci = TRUE,
it additionally carries a ci attribute with matrices
lwr.ci and upr.ci, plus attr(x, "conf.level"). Scalar
outputs keep the same point estimate and gain the same metadata only when
inference is requested.
Thiago de Paula Oliveira
Olsson, U., Drasgow, F., & Dorans, N. J. (1982). The polyserial correlation coefficient. Psychometrika, 47(3), 337-347.
Fisher, R. A. (1921). On the probable error of a coefficient of correlation deduced from a small sample. Metron, 1, 3-32.
set.seed(126)
n <- 1000
Sigma <- matrix(c(
1.00, 0.35, 0.50, 0.25,
0.35, 1.00, 0.30, 0.55,
0.50, 0.30, 1.00, 0.40,
0.25, 0.55, 0.40, 1.00
), 4, 4, byrow = TRUE)
Z <- mnormt::rmnorm(n = n, mean = rep(0, 4), varcov = Sigma)
X <- data.frame(x1 = Z[, 1], x2 = Z[, 2])
Y <- data.frame(
g1 = Z[, 3] > stats::qnorm(0.65),
g2 = Z[, 4] > stats::qnorm(0.55)
)
bs <- biserial(X, Y, ci = TRUE, p_value = TRUE)
print(bs, digits = 3)
summary(bs)
plot(bs)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.