View source: R/preference_order.R
preference_order | R Documentation |
Ranks a set of predictors by the strength of their association with a response. Aims to minimize the loss of important predictors during multicollinearity filtering.
The strength of association between the response and each predictor is computed by the function f
. The f
functions available are:
Numeric response vs numeric predictor:
f_r2_pearson()
: Pearson's R-squared.
f_r2_spearman()
: Spearman's R-squared.
f_r2_glm_gaussian()
: Pearson's R-squared of response versus the predictions of a Gaussian GLM.
f_r2_glm_gaussian_poly2()
: Gaussian GLM with second degree polynomial.
f_r2_gam_gaussian()
: GAM model fitted with mgcv::gam()
.
f_r2_rpart()
: Recursive Partition Tree fitted with rpart::rpart()
.
f_r2_rf()
: Random Forest model fitted with ranger::ranger()
.
Integer counts response vs. numeric predictor:
f_r2_glm_poisson()
: Pearson's R-squared of a Poisson GLM.
f_r2_glm_poisson_poly2()
: Poisson GLM with second degree polynomial.
f_r2_gam_poisson()
: Poisson GAM.
Binomial response (1 and 0) vs. numeric predictor:
f_auc_glm_binomial()
: AUC of quasibinomial GLM with weighted cases.
f_auc_glm_binomial_poly2()
: As above with second degree polynomial.
f_auc_gam_binomial()
: Quasibinomial GAM with weighted cases.
f_auc_rpart()
: Recursive Partition Tree with weighted cases.
f_auc_rf()
: Random Forest model with weighted cases.
Categorical response (character of factor) vs. categorical predictor:
f_v()
: Cramer's V between two categorical variables.
Categorical response vs. categorical or numerical predictor:
f_v_rf_categorical()
: Cramer's V of a Random Forest model.
The name of the used function is stored in the attribute "f_name" of the output data frame. It can be retrieved via attributes(df)$f_name
Additionally, any custom function accepting a data frame with the columns "x" (predictor) and "y" (response) and returning a numeric indicator of association where higher numbers indicate higher association will work.
This function returns a data frame with the column "predictor", with predictor names ordered by the column "preference", with the result of f
. This data frame, or the column "predictor" alone, can be used as inputs for the argument preference_order
in collinear()
, cor_select()
, and vif_select()
.
Accepts a parallelization setup via future::plan()
and a progress bar via progressr::handlers()
(see examples).
Accepts a character vector of response variables as input for the argument response
. When more than one response is provided, the output is a named list of preference data frames.
preference_order(
df = NULL,
response = NULL,
predictors = NULL,
f = "auto",
warn_limit = NULL,
quiet = FALSE
)
df |
(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL. |
response |
(optional; character string or vector) Name/s of response variable/s in |
predictors |
(optional; character vector) Names of the predictors to select from |
f |
(optional: function) Function to compute preference order. If "auto" (default) or NULL, the output of
Default: NULL |
warn_limit |
(optional, numeric) Preference value (R-squared, AUC, or Cramer's V) over which a warning flagging suspicious predictors is issued. Disabled if NULL. Default: NULL |
quiet |
(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE |
data frame: columns are "response", "predictor", "f" (function name), and "preference".
Blas M. Benito, PhD
#subsets to limit example run time
df <- vi[1:1000, ]
predictors <- vi_predictors[1:10]
predictors_numeric <- vi_predictors_numeric[1:10]
#parallelization setup
future::plan(
future::multisession,
workers = 2 #set to parallelly::availableCores() - 1
)
#progress bar
# progressr::handlers(global = TRUE)
#numeric response and predictors
#------------------------------------------------
#selects f automatically depending on data features
#applies f_r2_pearson() to compute correlation between response and predictors
df_preference <- preference_order(
df = df,
response = "vi_numeric",
predictors = predictors_numeric,
f = NULL
)
#returns data frame ordered by preference
df_preference
#several responses
#------------------------------------------------
responses <- c(
"vi_categorical",
"vi_counts"
)
preference_list <- preference_order(
df = df,
response = responses,
predictors = predictors
)
#returns a named list
names(preference_list)
preference_list[[1]]
preference_list[[2]]
#can be used in collinear()
# x <- collinear(
# df = df,
# response = responses,
# predictors = predictors,
# preference_order = preference_list
# )
#f function selected by user
#for binomial response and numeric predictors
# preference_order(
# df = vi,
# response = "vi_binomial",
# predictors = predictors_numeric,
# f = f_auc_glm_binomial
# )
#disable parallelization
future::plan(future::sequential)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.