collinear | R Documentation |
Automates multicollinearity management in data frames with numeric and non-numeric predictors by combining four methods:
Target Encoding: When a numeric response
is provided and encoding_method
is not NULL, it transforms categorical predictors (classes "character" and "factor") to numeric using the response values as reference. See target_encoding_lab()
for further details.
Preference Order: When a response of any type is provided via response
, the association between the response and each predictor is computed with an appropriate function (see preference_order()
and f_auto()
), and all predictors are ranked from higher to lower association. This rank is used to preserve important predictors during the multicollinearity filtering.
Pairwise Correlation Filtering: Automated multicollinearity filtering via pairwise correlation. Correlations between numeric and categoricals predictors are computed by target-encoding the categorical against the predictor, and correlations between categoricals are computed via Cramer's V. See cor_select()
, cor_df()
, and cor_cramer_v()
for further details.
VIF filtering: Automated algorithm to identify and remove numeric predictors that are linear combinations of other predictors. See vif_select()
and vif_df()
.
Accepts a parallelization setup via future::plan()
and a progress bar via progressr::handlers()
(see examples).
Accepts a character vector of response variables as input for the argument response
. When more than one response is provided, the output is a named list of character.
collinear(
df = NULL,
response = NULL,
predictors = NULL,
encoding_method = "loo",
preference_order = "auto",
f = "auto",
max_cor = 0.75,
max_vif = 5,
quiet = FALSE
)
df |
(required; data frame, tibble, or sf) A data frame with responses and predictors. Default: NULL. |
response |
(optional; character string or vector) Name/s of response variable/s in |
predictors |
(optional; character vector) Names of the predictors to select from |
encoding_method |
(optional; character string). Name of the target encoding method. One of: "loo", "mean", or "rank". If NULL, target encoding is disabled. Default: "loo" |
preference_order |
(optional; string, character vector, output of
. Default: "auto" |
f |
(optional: function) Function to compute preference order. If "auto" (default) or NULL, the output of
Default: NULL |
max_cor |
(optional; numeric) Maximum correlation allowed between any pair of variables in |
max_vif |
(optional, numeric) Maximum Variance Inflation Factor allowed during variable selection. Recommended values are between 2.5 and 10. Higher values return larger number of predictors with a higher multicollinearity. If NULL, the variance inflation analysis is disabled. Default: 5. |
quiet |
(optional; logical) If FALSE, messages generated during the execution of the function are printed to the console Default: FALSE |
character vector if response
is NULL or is a string.
named list if response
is a character vector.
When the argument response
names a numeric response variable, categorical predictors in predictors
(or in the columns of df
if predictors
is NULL) are converted to numeric via target encoding with the function target_encoding_lab()
. When response
is NULL or names a categorical variable, target-encoding is skipped. This feature facilitates multicollinearity filtering in data frames with mixed column types.
This feature is designed to help protect important predictors during the multicollinearity filtering. It involves the arguments preference_order
and f
.
The argument preference_order
accepts:
: A character vector of predictor names in a custom order of preference, from first to last. This vector does not need to contain all predictor names, but only the ones relevant to the user.
A data frame returned by preference_order()
, which ranks predictors based on their association with a response variable.
If NULL, and response
is provided, then preference_order()
is used internally to rank the predictors using the function f
. If f
is NULL as well, then f_auto()
selects a proper function based on the data properties.
The Variance Inflation Factor for a given variable a
is computed as 1/(1-R2)
, where R2
is the multiple R-squared of a multiple regression model fitted using a
as response and all other predictors in the input data frame as predictors, as in a = b + c + ...
.
The square root of the VIF of a
is the factor by which the confidence interval of the estimate for a
in the linear model y = a + b + c + ...
' is widened by multicollinearity in the model predictors.
The range of VIF values is (1, Inf]. The recommended thresholds for maximum VIF may vary depending on the source consulted, being the most common values, 2.5, 5, and 10.
The function vif_select()
computes Variance Inflation Factors and removes variables iteratively, until all variables in the resulting selection have a VIF below max_vif
.
If the argument preference_order
is not provided, all variables are ranked from lower to higher VIF, as returned by vif_df()
, and the variable with the higher VIF above max_vif
is removed on each iteration.
If preference_order
is defined, whenever two or more variables are above max_vif
, the one higher in preference_order
is preserved, and the next one with a higher VIF is removed. For example, for the predictors and preference order a
and b
, if any of their VIFs is higher than max_vif
, then b
will be removed no matter whether its VIF is lower or higher than a
's VIF. If their VIF scores are lower than max_vif
, then both are preserved.
The function cor_select()
applies a recursive forward selection algorithm to keep predictors with a maximum Pearson correlation with all other selected predictors lower than max_cor
.
If the argument preference_order
is NULL, the predictors are ranked from lower to higher sum of absolute pairwise correlation with all other predictors.
If preference_order
is defined, whenever two or more variables are above max_cor
, the one higher in preference_order
is preserved. For example, for the predictors and preference order a
and b
, if their correlation is higher than max_cor
, then b
will be removed and a
preserved. If their correlation is lower than max_cor
, then both are preserved.
David A. Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. DOI: 10.1002/0471725153.
Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32. DOI: 10.1145/507533.507538
#parallelization setup
future::plan(
future::multisession,
workers = 2 #set to parallelly::availableCores() - 1
)
#progress bar
#progressr::handlers(global = TRUE)
#subset to limit example run time
df <- vi[1:500, ]
#predictors has mixed types
#small subset to speed example up
predictors <- c(
"swi_mean",
"soil_type",
"soil_temperature_mean",
"growing_season_length",
"rainfall_mean"
)
#with numeric responses
#--------------------------------
# target encoding
# automated preference order
# all predictors filtered by correlation and VIF
x <- collinear(
df = df,
response = c(
"vi_numeric",
"vi_binomial"
),
predictors = predictors
)
x
#with custom preference order
#--------------------------------
x <- collinear(
df = df,
response = "vi_numeric",
predictors = predictors,
preference_order = c(
"swi_mean",
"soil_type"
)
)
#pre-computed preference order
#--------------------------------
preference_df <- preference_order(
df = df,
response = "vi_numeric",
predictors = predictors
)
x <- collinear(
df = df,
response = "vi_numeric",
predictors = predictors,
preference_order = preference_df
)
#resetting to sequential processing
future::plan(future::sequential)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.