rho.bounds: Estimates plausible values of the Pearson's correlation...
In StatMatch: Statistical Matching or Data Fusion

rho.bounds

R Documentation

Estimates plausible values of the Pearson's correlation coefficient between two variables observed in distinct samples referred to the same target population.

Description

This function assesses the uncertainty in estimating the Pearson's correlation coefficient between y.rec (Y) and z.don (Z) when the two variables are observed in two different samples sharing a number of common predictors.

Usage

rho.bounds(data.rec, data.don,
           match.vars, y.rec, z.don,
           w.rec = NULL, w.don = NULL)

Arguments

`data.rec`	dataframe including the Xs (predictors, listed in `match.vars`) and `y.rec` (response; target variable in this dataset).
`data.don`	dataframe including the Xs (predictors, listed in `match.vars`) and `z.don` (response; target variable in this dataset).
`match.vars`	vector with the names of the Xs variables to be used, jointly with `y.rec` and `z.don`, in estimating the correlation matrix. If `match.vars` include one or more factor variables these will be replaced with the corresponding dummies before estimating the correlation matrix.
`y.rec`	character indicating the name of Y target variable in `data.rec`. It should be a numeric variable.
`z.don`	character indicating the name of Z target variable in `data.don`. It should be a numeric variable.
`w.rec`	name of the variable with units' weights in `data.rec`, if available (default NULL); the weights, if provided, are used in estimating the bounds.
`w.don`	name of the variable with units' weights in `data.don`, if available (default NULL); the weights, if provided, are used in estimating the bounds.

Details

This function evaluates the uncertainty in the estimation of the Pearson's correlation coefficient between y.rec (Y) and z.don (Z), when the two variables are observed in two different samples that refer to the same target population, but that share a set of common predictors X (match.vars). The evaluation of the uncertainty corresponds to the estimation of the bounds (lower and upper) of the correlation coefficient between Y and Z, given the available data. The method uses the expressions proposed by Rodgers and DeVol (1982). Note that the correlations between the X variables common to both samples (match.vars) are estimated after pooling the samples. Factor variables, if present in match.vars, are replaced by the corresponding dummies before estimating the correlation; this method suffers from a number of critical problems related to the estimation of biserial correlation and the underlying assumption of a Gaussian distribution. The correlation matrix between Y and Xs is estimated on data.rec, while the correlation matrix between Z and Xs is estimated on data.don; this way of working can in some cases give unreliable estimates due to problems with the samples (usually when they are not representative of the same target population).

Value

A vector with three values: the estimated lower bound for Pearson's correlation coefficient between y.rec(Y) and z.don (Z); the estimated upper bound; and, the mid-point of the interval that corresponds to the estimate Pearson's correlation coefficient under the conditional independence assumption (i.e. the correlations between Y and Z is fully explained by the available X variables match.vars).

Author(s)

Marcello D'Orazio mdo.statmatch@gmail.com

References

D'Orazio, M., (2024). Is Statistical Matching feasible? Note, https://www.researchgate.net/publication/387699016_Is_statistical_matching_feasible.

Rodgers, W.L. and DeVol E.B. (1982). An evaluation of statistical matching. Report Submitted to the Income Survey Development Program, Dept. of Health and Human Services, Institute for Social Reasearch, University of Michigan.

Examples

set.seed(11335577)
pos <- sample(x = 1:150, size = 60, replace = FALSE)
ir.A <- iris[pos, c(1:3, 5)]
ir.B <- iris[-pos, c(1:2, 4:5)]

intersect(colnames(ir.A), colnames(ir.B)) # shared Xs

# Xs without Species (factor)
out.1 <- rho.bounds(data.rec=ir.A, data.don=ir.B, 
                    match.vars=c("Sepal.Length", "Sepal.Width"),
                   y.rec="Petal.Length", z.don="Petal.Width")
out.1

# Xs with Species (factor)
out.2 <- rho.bounds(data.rec=ir.A, data.don=ir.B, 
                    match.vars=c("Sepal.Length", "Sepal.Width", "Species"),
                    y.rec="Petal.Length", z.don="Petal.Width")
out.2

StatMatch documentation built on April 3, 2025, 10:03 p.m.