prepXY: Prepare the 'x' and 'y' inputs
In ummel/fusionModel: Data fusion and analysis of synthetic data in R

prepXY

R Documentation

Prepare the 'x' and 'y' inputs

Description

Optional-but-useful function to: 1) provide a plausible ordering of the 'y' (fusion) variables and 2) identify the subset of 'x' (predictor) variables likely to be consequential during subsequent model training. Output can be passed directly to train. Most useful for large datasets with many and/or highly-correlated predictors. Employs an absolute Spearman rank correlation screen and then LASSO models (via glmnet) to return a plausible ordering of 'y' and the preferred subset of 'x' variables associated with each.

Usage

prepXY(
  data,
  y,
  x,
  weight = NULL,
  cor_thresh = 0.05,
  lasso_thresh = 0.95,
  xmax = 100,
  xforce = NULL,
  fraction = 1,
  cores = 1
)

Arguments

`data`	Data frame. Training dataset. All categorical variables should be factors and ordered whenever possible.
`y`	Character or list. Variables in `data` to eventually fuse to a recipient dataset. If `y` is a list, each entry is a character vector possibly indicating multiple variables to fuse as a block.
`x`	Character. Predictor variables in `data` common to donor and eventual recipient.
`weight`	Character. Name of the observation weights column in `data`. If NULL (default), uniform weights are assumed.
`cor_thresh`	Numeric. Predictors that exhibit less than `cor_thresh` absolute Spearman (rank) correlation with a `y` variable are screened out prior to the LASSO step. Fast exclusion of predictors that the LASSO step probably doesn't need to consider.
`lasso_thresh`	Numeric. Controls how aggressively the LASSO step screens out predictors. Lower value is more aggressive. `lasso_thresh = 0.95`, for example, retains predictors that collectively explain at least 95% of the deviance explained by a "full" model.
`xmax`	Integer. Maximum number of predictors returned by LASSO step. Does not strictly control the number of final predictors returned (especially for categorical `y` variables), but useful for setting a (very) soft upper bound. Lower `xmax` can help control computation time if a large number of `x` pass the correlation screen. `xmax = Inf` imposes no restriction.
`xforce`	Character. Subset of `x` variables to "force" as included predictors in the results.
`fraction`	Numeric. Fraction of observations in `data` to randomly sample. For larger datasets, sampling often has minimal effect on results but speeds up computation.
`cores`	Integer. Number of cores used. Only applicable on Unix systems.

Value

List with named slots "y" and "x". Each is a list of the same length. Former gives the preferred fusion order. Latter gives the preferred sets of predictor variables.

Examples

y <- names(recs)[c(14:16, 20:22)]
x <- names(recs)[2:13]

# Fusion variable "blocks" are respected by prepXY()
y <- c(list(y[1:2]), y[-c(1:2)])

# Do the prep work...
prep <- prepXY(data = recs, y = y, x = x)

# The result can be passed to train()
train(data = recs, y = prep$y, x = prep$x)

ummel/fusionModel documentation built on June 1, 2025, 11 p.m.