corr_rm | R Documentation |
Remove highly correlated variables from a data frame to reduce pair-wise redundancy and mitigate multicollinearity issues in predictive models. This preprocessing step is especially useful when the goal is prediction rather than interpretation, because hypotheses about each individual predictor are not the primary concern.
For example, in a genomic prediction study the authors removed highly correlated SNPs to avoid redundant information when working with thousands of markers, improving training efficiency and predictive performance (Wimmer et al. 2021).
In the paper "A Proposed Data Analytics Workflow and Example Using the R Caret Package",
this filtering step is applied before model training, demonstrating how the core function
caret::findCorrelation
can be used to identify and remove highly correlated variable pairs.
Note that while high correlation can bias algorithms like clustering algorithms toward redundant variables, it is much less problematic for tree-based learners.
corr_rm(df, c, ...)
## S3 method for class 'clist'
corr_rm(
df,
c,
col = c("infer.value", "stat.value"),
isig = TRUE,
cutoff = 0.75,
...
)
## S3 method for class 'list'
corr_rm(
df,
c,
col = c("infer.value", "stat.value"),
isig = TRUE,
cutoff = 0.75,
...
)
## S3 method for class 'cmatrix'
corr_rm(df, c, cutoff = 0.75, ...)
## S3 method for class 'matrix'
corr_rm(df, c, cutoff = 0.75, ...)
df |
[ |
c |
[ |
... |
Additional arguments passed to the |
col |
[ |
isig |
[ |
cutoff |
[ |
data.frame
A filtered version of df
with highly correlated variables removed.
Igor D.S. Siciliani, Paulo H. dos Santos
Wimmer, V.; Albrecht, T.; Auinger, H.-J.; Schön, C.-C. (2021). Genomic prediction studies in plants and animals: Removing highly correlated SNPs to reduce redundancy. PLoS Genetics, 17(3), e1009243. URL: https://doi.org/10.3389/fgene.2021.611506
Jones, S.; Ye, Z.; Xie, Z.; Root, C.; Prasutchai, T.; Anderson, J.; Roggenburg, M.; Lanham, M. A. (2018). A Proposed Data Analytics Workflow and Example Using the R Caret Package. Midwest Decision Sciences Institute (MWDSI) Conference. URL: https://www.matthewalanham.com/Students/2018_MWDSI_R%20caret%20paper.pdf
iris_clist <- corrp(iris)
iris_cmatrix <- corr_matrix(iris_clist)
corr_rm(df = iris, c = iris_clist, cutoff = 0.75, col = "infer.value", isig = FALSE)
corr_rm(df = iris, c = iris_cmatrix, cutoff = 0.75, col = "infer.value", isig = FALSE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.