low_predictor_collinearity: low_predictor_collinearity

View source: R/low_predictor_collinearity.R

low_predictor_collinearityR Documentation

low_predictor_collinearity

Description

Function provides a way to identify a set of predictors with pairwise low collinearity among themselves.

With the submission of a data frame of raw predictor values or correlation matrix among predictors, the function removes the minimum number of predictors to ensure that all correlations are below a certain threshold.

Usage

low_predictor_collinearity(df = NULL, cor = NULL, threshold = 0.75)

Arguments

df

An optional numeric data frame of predictor variables without NA values

cor

An optional matrix of cross correlations among the predictor variables

threshold

A numeric that sets the minimum correlation between pairs of predictors to run through the algorithm.

Details

Function was inspired by "Applied Predictive Modeling", Kuhn, Johnson, Page 47.

Note that predictors of the data frame must all be numeric without NA values.

The function's algorithm follows the following steps:

1. Create a starting list of the all the candidate predictors.

2. Create a second list of pairs of predictors with correlations above a given threshold and order the correlations from high to low.

3. For each pair of predictors (call them A and B) in the ordered list, determine the average correlation between predictor A and the other predictors. Do the same for predictor B.

4. If A has a larger absolute average correlation, remove it from the ordered list and from the start list created in step 1; otherwise remove predictor B.

5. Repeat steps 3-4 through the entire ordered list of correlations defined in step 2, removing potential predictors from the starting list created in step 1.

6. The predictors left in the starting list are identified as having a low level of collinearity.

Value

Returning a named list with:

  1. "predictors" A character vector with the names of predictors with pairwise low collinearity among themselves.

  2. "correlations" The correlation matrix with just the selected predictors.

  3. "max_correlation" The maximum correlation among all pairs of the selected predictors.

Examples

library(data.table)
library(RregressPkg)

bloodpress_predictors_dt <- RregressPkg::bloodpress[, !c("BP")]
low_collinearity_lst <- RregressPkg::low_predictor_collinearity(
  df = bloodpress_predictors_dt
)


deandevl/RregressPkg documentation built on Feb. 5, 2025, 12:11 p.m.