preference_order: Rank predictors by importance or multicollinearity

View source: R/preference_order.R

preference_orderR Documentation

Rank predictors by importance or multicollinearity

Description

Generates a valid input for the argument preference_order of the functions vif_select(), cor_select(), collinear_select(), and collinear(). This argument helps preserve important predictors during multicollinearity filtering.

The function works in two different ways:

  • When f is NULL, it ranks the predictors from lower to higher multicollinearity, computed as one minus the average Pearson correlation between the given predictor against all others. This option is useful when the goal is to limit redundancy in a large dataset and there is not an specific model to train in mind.

  • When responses and f are not NULL, it ranks the predictors by the strength of their association with a response based on the evaluation of univariate models. This is the best possible option when the end-goal is training a model.

The argument f (requires a valid resopnses argument) defines how the strength of association between the response and each predictor is computed. By default it calls f_auto(), which uses f_auto_rules() to select a suitable function depending on the types of the response and the predictors. This option is designed to provide sensible, general-purpose defaults optimized for speed and stability rather than any specific modeling approach.

For more fine-tuned control, the package offers the following f functions (see f_functions()):

  • Numeric response:

    • f_numeric_glm(): Pearson's R-squared of response versus the predictions of a Gaussian GLM.

    • f_numeric_gam(): GAM model fitted with mgcv::gam().

    • f_numeric_rf(): Random Forest model fitted with ranger::ranger().

  • Integer counts response:

    • f_count_glm(): Pearson's R-squared of a Poisson GLM.

    • f_count_gam(): Poisson GAM.

    • f_count_rf(): Random Forest model fitted with ranger::ranger().

  • Binomial response (1 and 0):

    • f_binomial_glm(): AUC of Quasibinomial GLM with weighted cases.

    • f_binomial_gam(): AUC of Quasibinomial GAM with weighted cases.

    • f_binomial_rf(): AUC of a Random Forest model with weighted cases.

  • Categorical response:

    • f_categorical_rf(): Cramer's V of the response against the predictions of a classification Random Forest model.

These functions accept a cross-validation setup via the arguments cv_iterations and cv_training_fraction.

Additionally, the argument f accepts any custom function taking a dataframe with the columns "x" (predictor) and "y" (response) and returning a numeric indicator of association.

Accepts a parallelization setup via future::plan() and a progress bar via progressr::handlers() (see examples).

Accepts a character vector of response variables as input for the argument responses. When more than one response is provided, the output is a named list of preference data frames.

Usage

preference_order(
  df = NULL,
  responses = NULL,
  predictors = NULL,
  f = f_auto,
  cv_training_fraction = 1,
  cv_iterations = 1,
  seed = 1,
  quiet = FALSE,
  ...
)

Arguments

df

(required; dataframe, tibble, or sf) A dataframe with responses (optional) and predictors. Must have at least 10 rows for pairwise correlation analysis, and 10 * (length(predictors) - 1) for VIF. Default: NULL.

responses

(optional; character, character vector, or NULL) Name of one or several response variables in df. Default: NULL.

predictors

(optional; character vector or NULL) Names of the predictors in df. If NULL, all columns except responses and constant/near-zero-variance columns are used. Default: NULL.

f

(optional: function name) Unquoted function name without parenthesis (see f_functions). By default calls to f_auto(), which selects a suitable function depending on the nature of the response and predictors. Set to NULL if responses = NULL. If NULL, predictors are ranked from lower to higher multicollinearity. Default: f_auto

cv_training_fraction

(optional, numeric) Value between 0.1 and 1 defining the training faction used in cross-validation. If 1 (default), no cross-validation is performed, and the resulting metric is computed from all observations and predictions. Automatically set to 1 when cv_iterations = 1. Default: 1

cv_iterations

(optional, integer) Number of cross-validation iterations to perform. The recommended range lies between 30 and 100. In general, smaller datasets and large values of cv_training_fraction require more iterations to achieve stability. Automatically set to 1 when cv_training_fraction = 1. Default: 1

seed

(optional, integer) Random seed, required for reproducibility when using cross-validation or random forest models. Default: 1

quiet

(optional; logical) If FALSE, messages are printed. Default: FALSE.

...

(optional) Internal args (e.g. function_name for validate_arg_function_name, a precomputed correlation matrix m, or cross-validation args for preference_order).

Value

dataframe:

  • response: character, response name, if any, or "none" otherwise.

  • predictor: character, name of the predictor.

  • f: name of the function used to compute the preference order. If argument f is NULL, the value "stats::cor()" is added to this column.

  • metric: name of the metric used to assess strength of association. Usually one of "R-squared", "AUC" (Area Under the ROC Curve), or "Cramer's V". If f is a custom function not in f_functions(), then metric is set to "custom". If f is NULL, then "1 - R-squared" is returned in this column.

  • score: value of the metric returned by f to assess the association between the response and each given predictor.

  • rank: integer value indicating the rank of the predictor.

Author(s)

Blas M. Benito, PhD

See Also

Other preference_order_functions: f_binomial_gam(), f_binomial_glm(), f_binomial_rf(), f_categorical_rf(), f_count_gam(), f_count_glm(), f_count_rf(), f_numeric_gam(), f_numeric_glm(), f_numeric_rf()

Examples

#load example data
data(
  vi_smol,
  vi_predictors_numeric
)

##OPTIONAL: parallelization setup
# future::plan(
#   future::multisession,
#   workers = future::availableCores() - 1
# )

##OPTIONAL: progress bar
##does not work in R examples
# progressr::handlers(global = TRUE)

#ranking predictors from lower to higher multicollinearity
#------------------------------------------------
x <- preference_order(
  df = vi_smol,
  responses = NULL, #default value
  predictors = vi_predictors_numeric[1:10],
  f = NULL #must be explicit
)

x

#automatic selection of ranking function
#------------------------------------------------
x <- preference_order(
  df = vi_smol,
  responses = c("vi_numeric", "vi_categorical"),
  predictors = vi_predictors_numeric[1:10],
  f = f_auto
  )

x

#user selection of ranking function
#------------------------------------------------
#Poisson GLM for a integer counts response
x <- preference_order(
  df = vi_smol,
  responses = "vi_binomial",
  predictors = vi_predictors_numeric[1:10],
  f = f_binomial_glm
)

x

#cross-validation
#------------------------------------------------
x <- preference_order(
  df = vi_smol,
  responses = "vi_binomial",
  predictors = vi_predictors_numeric[1:10],
  f = f_binomial_glm,
  cv_training_fraction = 0.5,
  cv_iterations = 10
)

x

#custom pairwise correlation function
#------------------------------------------------
#custom functions need the ellipsis argument
f_rsquared <- function(df, ...){
    stats::cor(
      x = df$x,
      y = df$y,
      use = "complete.obs"
    )^2
}

x <- preference_order(
  df = vi_smol,
  responses = "vi_numeric",
  predictors = vi_predictors_numeric[1:10],
  f = f_rsquared
)

x

#resetting to sequential processing
#future::plan(future::sequential)

collinear documentation built on Dec. 8, 2025, 5:06 p.m.