View source: R/collinear_select.R
| collinear_select | R Documentation |
Automatizes multicollinearity filtering via pairwise correlation and/or variance inflation factors in dataframes with numeric and categorical predictors.
The argument max_cor determines the maximum pairwise correlation allowed in the resulting selection of predictors, while max_vif does the same for variance inflation factors.
The argument preference_order accepts a character vector of predictor names ranked from first to last index, or a dataframe resulting from preference_order(). When two predictors in this vector or dataframe are highly collinear, the one with a lower ranking is removed. This option helps protect predictors of interest. If not provided, predictors are ranked from lower to higher multicollinearity.
Please check the sections Variance Inflation Factors, VIF-based Filtering, and Pairwise Correlation Filtering at the end of this help file for further details.
collinear_select(
df = NULL,
response = NULL,
predictors = NULL,
preference_order = NULL,
max_cor = 0.61,
max_vif = 5,
quiet = FALSE,
...
)
df |
(required; dataframe, tibble, or sf) A dataframe with responses
(optional) and predictors. Must have at least 10 rows for pairwise
correlation analysis, and |
response |
(optional; character or NULL) Name of one response variable in |
predictors |
(optional; character vector or NULL) Names of the
predictors in |
preference_order |
(optional; character vector, dataframe from
|
max_cor |
(optional; numeric or NULL) Maximum correlation allowed between pairs of |
max_vif |
(optional, numeric or NULL) Maximum Variance Inflation Factor allowed for |
quiet |
(optional; logical) If FALSE, messages are printed. Default: FALSE. |
... |
(optional) Internal args (e.g. |
character vector: names of selected predictors
cor_select computes the global correlation matrix, orders
predictors by preference_order or by lower-to-higher summed
correlations, and sequentially selects predictors with pairwise correlations
below max_cor.
VIF for predictor a is computed as 1/(1-R^2), where R^2 is
the multiple R-squared from regressing a on the other predictors.
Recommended maximums commonly used are 2.5, 5, and 10.
vif_select ranks numeric predictors (user preference_order
if provided, otherwise from lower to higher VIF) and sequentially adds
predictors whose VIF against the current selection is below max_vif.
Blas M. Benito, PhD
Other multicollinearity_filtering:
collinear(),
cor_select(),
step_collinear(),
vif_select()
data(vi_smol)
## OPTIONAL: parallelization setup
## irrelevant when all predictors are numeric
## only worth it for large data with many categoricals
# future::plan(
# future::multisession,
# workers = future::availableCores() - 1
# )
## OPTIONAL: progress bar
# progressr::handlers(global = TRUE)
x <- collinear_select(
df = vi_smol,
predictors = c(
"koppen_zone", #character
"soil_type", #factor
"topo_elevation", #numeric
"soil_temperature_mean" #numeric
),
max_cor = 0.7,
max_vif = 5
)
x
## OPTIONAL: disable parallelization
#future::plan(future::sequential)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.