cor_select: Automated Multicollinearity Filtering with Pairwise...
In collinear: Automated Multicollinearity Management

cor_select

R Documentation

Automated Multicollinearity Filtering with Pairwise Correlations

Description

Implements a recursive forward selection algorithm to keep predictors with a maximum pairwise correlation with all other selected predictors lower than a given threshold. Uses cor_df() underneath, and as such, can handle different combinations of predictor types.

Please check the section Pairwise Correlation Filtering at the end of this help file for further details.

Usage

cor_select(
  df = NULL,
  predictors = NULL,
  preference_order = NULL,
  max_cor = 0.75,
  quiet = FALSE
)

Arguments

`df`	(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.
`predictors`	(optional; character vector) Names of the predictors to select from `df`. If omitted, all numeric columns in `df` are used instead. If argument `response` is not provided, non-numeric variables are ignored. Default: NULL
`preference_order`	(optional; string, character vector, output of `preference_order()`). Defines a priority order, from first to last, to preserve predictors during the selection process. Accepted inputs are: "auto" (default): if `response` is not NULL, calls `preference_order()` for internal computation. character vector: predictor names in a custom preference order. data frame: output of `preference_order()` from `response` of length one. named list: output of `preference_order()` from `response` of length two or more. NULL: disabled. . Default: "auto"
`max_cor`	(optional; numeric) Maximum correlation allowed between any pair of variables in `predictors`. Recommended values are between 0.5 and 0.9. Higher values return larger number of predictors with a higher multicollinearity. If NULL, the pairwise correlation analysis is disabled. Default: `0.75`
`quiet`	(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE

Value

character vector if response is NULL or is a string.
named list if response is a character vector.

Pairwise Correlation Filtering

The function cor_select() applies a recursive forward selection algorithm to keep predictors with a maximum Pearson correlation with all other selected predictors lower than max_cor.

If the argument preference_order is NULL, the predictors are ranked from lower to higher sum of absolute pairwise correlation with all other predictors.

If preference_order is defined, whenever two or more variables are above max_cor, the one higher in preference_order is preserved. For example, for the predictors and preference order a and b, if their correlation is higher than max_cor, then b will be removed and a preserved. If their correlation is lower than max_cor, then both are preserved.

Author(s)

Blas M. Benito, PhD

Examples

#subset to limit example run time
df <- vi[1:1000, ]

#only numeric predictors only to speed-up examples
#categorical predictors are supported, but result in a slower analysis
predictors <- vi_predictors_numeric[1:8]

#predictors has mixed types
sapply(
  X = df[, predictors, drop = FALSE],
  FUN = class
)

#parallelization setup
future::plan(
  future::multisession,
  workers = 2 #set to parallelly::availableCores() - 1
)

#progress bar
# progressr::handlers(global = TRUE)

#without preference order
x <- cor_select(
  df = df,
  predictors = predictors,
  max_cor = 0.75
)


#with custom preference order
x <- cor_select(
  df = df,
  predictors = predictors,
  preference_order = c(
    "swi_mean",
    "soil_type"
  ),
  max_cor = 0.75
)


#with automated preference order
df_preference <- preference_order(
  df = df,
  response = "vi_numeric",
  predictors = predictors
)

x <- cor_select(
  df = df,
  predictors = predictors,
  preference_order = df_preference,
  max_cor = 0.75
)

#resetting to sequential processing
future::plan(future::sequential)

collinear documentation built on April 12, 2025, 1:36 a.m.