preference_order: Quantitative Variable Prioritization for Multicollinearity...

View source: R/preference_order.R

preference_orderR Documentation

Quantitative Variable Prioritization for Multicollinearity Filtering

Description

Ranks a set of predictors by the strength of their association with a response. Aims to minimize the loss of important predictors during multicollinearity filtering.

The strength of association between the response and each predictor is computed by the function f. The f functions available are:

  • Numeric response vs numeric predictor:

    • f_r2_pearson(): Pearson's R-squared.

    • f_r2_spearman(): Spearman's R-squared.

    • f_r2_glm_gaussian(): Pearson's R-squared of response versus the predictions of a Gaussian GLM.

    • f_r2_glm_gaussian_poly2(): Gaussian GLM with second degree polynomial.

    • f_r2_gam_gaussian(): GAM model fitted with mgcv::gam().

    • f_r2_rpart(): Recursive Partition Tree fitted with rpart::rpart().

    • f_r2_rf(): Random Forest model fitted with ranger::ranger().

  • Integer counts response vs. numeric predictor:

    • f_r2_glm_poisson(): Pearson's R-squared of a Poisson GLM.

    • f_r2_glm_poisson_poly2(): Poisson GLM with second degree polynomial.

    • f_r2_gam_poisson(): Poisson GAM.

  • Binomial response (1 and 0) vs. numeric predictor:

    • f_auc_glm_binomial(): AUC of quasibinomial GLM with weighted cases.

    • f_auc_glm_binomial_poly2(): As above with second degree polynomial.

    • f_auc_gam_binomial(): Quasibinomial GAM with weighted cases.

    • f_auc_rpart(): Recursive Partition Tree with weighted cases.

    • f_auc_rf(): Random Forest model with weighted cases.

  • Categorical response (character of factor) vs. categorical predictor:

    • f_v(): Cramer's V between two categorical variables.

  • Categorical response vs. categorical or numerical predictor:

    • f_v_rf_categorical(): Cramer's V of a Random Forest model.

The name of the used function is stored in the attribute "f_name" of the output data frame. It can be retrieved via attributes(df)$f_name

Additionally, any custom function accepting a data frame with the columns "x" (predictor) and "y" (response) and returning a numeric indicator of association where higher numbers indicate higher association will work.

This function returns a data frame with the column "predictor", with predictor names ordered by the column "preference", with the result of f. This data frame, or the column "predictor" alone, can be used as inputs for the argument preference_order in collinear(), cor_select(), and vif_select().

Accepts a parallelization setup via future::plan() and a progress bar via progressr::handlers() (see examples).

Accepts a character vector of response variables as input for the argument response. When more than one response is provided, the output is a named list of preference data frames.

Usage

preference_order(
  df = NULL,
  response = NULL,
  predictors = NULL,
  f = "auto",
  warn_limit = NULL,
  quiet = FALSE
)

Arguments

df

(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL.

response

(optional; character string or vector) Name/s of response variable/s in df. Used in target encoding when it names a numeric variable and there are categorical predictors, and to compute preference order. Default: NULL.

predictors

(optional; character vector) Names of the predictors to select from df. If omitted, all numeric columns in df are used instead. If argument response is not provided, non-numeric variables are ignored. Default: NULL

f

(optional: function) Function to compute preference order. If "auto" (default) or NULL, the output of f_auto() for the given data is used:

  • f_auc_rf(): if response is binomial.

  • f_r2_pearson(): if response and predictors are numeric.

  • f_v(): if response and predictors are categorical.

  • f_v_rf_categorical(): if response is categorical and predictors are numeric or mixed .

  • f_r2_rf(): in all other cases.

Default: NULL

warn_limit

(optional, numeric) Preference value (R-squared, AUC, or Cramer's V) over which a warning flagging suspicious predictors is issued. Disabled if NULL. Default: NULL

quiet

(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE

Value

data frame: columns are "response", "predictor", "f" (function name), and "preference".

Author(s)

Blas M. Benito, PhD

Examples

#subsets to limit example run time
df <- vi[1:1000, ]
predictors <- vi_predictors[1:10]
predictors_numeric <- vi_predictors_numeric[1:10]

#parallelization setup
future::plan(
  future::multisession,
  workers = 2 #set to parallelly::availableCores() - 1
)

#progress bar
# progressr::handlers(global = TRUE)

#numeric response and predictors
#------------------------------------------------
#selects f automatically depending on data features
#applies f_r2_pearson() to compute correlation between response and predictors
df_preference <- preference_order(
  df = df,
  response = "vi_numeric",
  predictors = predictors_numeric,
  f = NULL
  )

#returns data frame ordered by preference
df_preference


#several responses
#------------------------------------------------
responses <- c(
  "vi_categorical",
  "vi_counts"
)

preference_list <- preference_order(
  df = df,
  response = responses,
  predictors = predictors
)

#returns a named list
names(preference_list)
preference_list[[1]]
preference_list[[2]]

#can be used in collinear()
# x <- collinear(
#   df = df,
#   response = responses,
#   predictors = predictors,
#   preference_order = preference_list
# )

#f function selected by user
#for binomial response and numeric predictors
# preference_order(
#   df = vi,
#   response = "vi_binomial",
#   predictors = predictors_numeric,
#   f = f_auc_glm_binomial
# )


#disable parallelization
future::plan(future::sequential)

collinear documentation built on April 12, 2025, 1:36 a.m.