| collinear | R Documentation |
Automates multicollinearity management in datasets with mixed variable types (numeric, categorical, and logical) through an integrated system of five key components:
When responses is numeric, categorical predictors can be converted
to numeric using response values as reference. This enables VIF and
correlation analysis across mixed types. See target_encoding_lab.
Three prioritization strategies ensure the most relevant predictors are retained during filtering:
User-defined ranking (argument preference_order):
Accepts a character vector of predictor names or a dataframe from
preference_order. Lower-ranked collinear predictors are removed.
Response-based ranking (f):
Uses f_auto, f_numeric_glm, or
f_binomial_rf to rank predictors by association with
the response. Supports cross-validation via preference_order.
Multicollinearity-based ranking (default):
When both preference_order and f are NULL,
predictors are ranked from lower to higher multicollinearity.
Computes pairwise correlations between variable types using Pearson
(numeric–numeric), target encoding (numeric–categorical), and Cramer's V
(categorical–categorical). See cor_df, cor_matrix,
and cor_cramer.
When max_cor and max_vif are both NULL, thresholds
are determined from the median correlation structure of the predictors.
Combines two complementary methods while respecting predictor rankings:
Pairwise Correlation Filtering:
Removes predictors with Pearson correlation or Cramer's V above
max_cor. See cor_select.
VIF-based Filtering:
Removes numeric predictors with VIF above max_vif. See
vif_select, vif_df, and vif.
This function accepts parallelization via future::plan() and progress
bars via progressr::handlers(). Parallelization benefits
target_encoding_lab, preference_order, and
cor_select.
collinear(
df = NULL,
responses = NULL,
predictors = NULL,
encoding_method = NULL,
preference_order = NULL,
f = f_auto,
max_cor = NULL,
max_vif = NULL,
quiet = FALSE,
...
)
df |
(required; dataframe, tibble, or sf) A dataframe with responses
(optional) and predictors. Must have at least 10 rows for pairwise
correlation analysis, and |
responses |
(optional; character, character vector, or NULL) Name of
one or several response variables in |
predictors |
(optional; character vector or NULL) Names of the
predictors in |
encoding_method |
(optional; character or NULL) One of "loo", "mean", or "rank". If NULL, target encoding is disabled. Default: NULL. |
preference_order |
(optional; character vector, dataframe from
|
f |
(optional; unquoted function name or NULL) Function to rank
predictors by relationship with |
max_cor |
(optional; numeric or NULL) Maximum allowed pairwise
correlation (0.01–0.99). Recommended between 0.5 and 0.9. If NULL and
|
max_vif |
(optional; numeric or NULL) Maximum allowed VIF. Recommended
between 2.5 and 10. If NULL and |
quiet |
(optional; logical) If FALSE, messages are printed. Default: FALSE. |
... |
(optional) Internal args (e.g. |
A list of class collinear_output with sublists of class
collinear_selection. If responses = NULL a single sublist
named "result" is returned; otherwise a sublist per response is returned.
When both max_cor and max_vif are NULL, the function
determines thresholds as follows:
Compute the 75th percentile of pairwise correlations via
cor_stats.
Map that value through a sigmoid between 0.545 (VIF~2.5) and 0.785
(VIF~7.5), centered at 0.665, to get max_cor.
Compute max_vif from max_cor using
gam_cor_to_vif.
VIF for predictor a is computed as 1/(1-R^2), where R^2 is
the multiple R-squared from regressing a on the other predictors.
Recommended maximums commonly used are 2.5, 5, and 10.
vif_select ranks numeric predictors (user preference_order
if provided, otherwise from lower to higher VIF) and sequentially adds
predictors whose VIF against the current selection is below max_vif.
cor_select computes the global correlation matrix, orders
predictors by preference_order or by lower-to-higher summed
correlations, and sequentially selects predictors with pairwise correlations
below max_cor.
David A. Belsley, D.A., Kuh, E., Welsch, R.E. (1980). Regression Diagnostics: Identifying Influential Data and Sources of Collinearity. John Wiley & Sons. DOI: 10.1002/0471725153.
Micci-Barreca, D. (2001) A Preprocessing Scheme for High-Cardinality Categorical Attributes in Classification and Prediction Problems. SIGKDD Explor. Newsl. 3, 1, 27-32. DOI: 10.1145/507533.507538
Other multicollinearity_filtering:
collinear_select(),
cor_select(),
step_collinear(),
vif_select()
data(vi_smol, vi_predictors_numeric)
x <- collinear(df = vi_smol[, vi_predictors_numeric])
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.