knitr::opts_chunk$set(collapse = TRUE, comment = "#>") library(corrselect)
corrselect
identifies all maximal subsets of variables whose pairwise correlations stay below a chosen threshold. This process reduces multicollinearity and redundancy before modeling, while preserving interpretability. Unlike greedy or stepwise approaches, corrselect
exhaustively searches for all valid subsets using fast, exact algorithms. It is fully model-agnostic, making it suitable as a preprocessing step for regression, clustering, feature selection, and other analyses.
Given a threshold ( t \in (0,1) ), the functions corrSelect()
(data-frame interface) and MatSelect()
(matrix interface) enumerate all maximal subsets ( S ) of variables satisfying:
[ \forall i, j \in S,\ i \neq j: \ |r_{ij}| < t ]
where (r_{ij}) denotes the chosen correlation measure between variables (i) and (j). Enumeration relies on two exact graph-theoretic algorithms:
Results are returned as a CorrCombo
S4 object containing each subset’s variable names and summary statistics (avg_corr
, min_corr
, max_corr
). You can then extract subsets from the original data via corrSubset()
. Because the procedure does not depend on any downstream model, it cleanly separates “feature curation” from “model fitting” and supports multiple correlation measures (pearson
, spearman
, kendall
, bicor
, distance
, maximal
).
CorrSelect
)set.seed(42) n <- 100 df <- data.frame( A = rnorm(n), B = rnorm(n), C = rnorm(n), D = rnorm(n), E = rnorm(n) ) df$F <- df$A * 0.9 + rnorm(n, sd = 0.1) # strongly correlated with A
res <- corrSelect(df, threshold = 0.7) res as.data.frame(res)
corrSubset(res, df, which = 1)[1:10,]
res2 <- corrSelect(df, threshold = 0.7, force_in = "A") res2
res3 <- corrSelect(df, threshold = 0.6, cor_method = "spearman") res3
MatSelect
)If you already computed a correlation matrix or want to apply the method to precomputed correlations:
mat <- cor(df) res4 <- MatSelect(mat, threshold = 0.7) res4
Selecting subsets:
MatSelect(mat, threshold = 0.5)
Force variable 1 into every subset:
MatSelect(mat, threshold = 0.5, force_in = 1)
assocSelect
)df_ass <- data.frame( height = rnorm(15, 170, 10), weight = rnorm(15, 70, 12), group = factor(rep(LETTERS[1:3], each = 5)), score = ordered(sample(c("low","med","high"), 15, TRUE)) ) # keep every subset whose internal associations ≤ 0.6 res5 <- assocSelect(df_ass, threshold = 0.6) res5
By default, corrSelect()
uses Pearson correlation. You can choose alternatives with the cor_method
argument:
"pearson"
: linear correlation (default) "spearman"
: rank-based monotonic association "kendall"
: Kendall’s tau "bicor"
: robust biweight midcorrelation (WGCNA::bicor
) "distance"
: distance correlation (energy::dcor
) "maximal"
: maximal information coefficient (minerva::mine
) Example:
res6 <- corrSelect(df, threshold = 0.7, cor_method = "spearman") res6
The function assocSelect()
extends corrSelect()
to support mixed data types — including numeric, factor, and ordered variables — by using appropriate association measures for each variable pair.
Instead of a single correlation matrix, it constructs a generalized association matrix using the following logic:
| Variable 1 | Variable 2 | Method Used |
|------------------|------------------|----------------|
| numeric | numeric | pearson
(default; customizable) |
| numeric | factor | eta
|
| numeric | ordered | spearman
(default; customizable) |
| factor | factor | cramersv
|
| factor | ordered | cramersv
|
| ordered | ordered | spearman
(default; customizable) |
The defaults for numeric-numeric, numeric-ordered, and ordered-ordered associations can be changed via arguments:
assocSelect(df_ass, method_num_num = "kendall", method_num_ord = "spearman", method_ord_ord = "kendall" )
All other combinations use fixed methods (eta
or cramersv
) appropriate for measuring association strength.
df_ass <- data.frame( height = rnorm(10), weight = rnorm(10), group = factor(sample(c("A", "B"), 10, replace = TRUE)), score = ordered(sample(1:3, 10, replace = TRUE)) ) res7 <- assocSelect(df_ass, threshold = 1, method = "bron-kerbosch", use_pivot = TRUE) res7
Each pairwise association is bounded to [0,1] and treated analogously to correlation.
Given a symmetric correlation matrix ( R \in \mathbb{R}^{p \times p} ), we seek all maximal subsets ( S \subseteq {1, \dots, p} ) such that:
[ \forall i, j \in S,\ i \neq j: \ |R_{ij}| < t ]
for a fixed threshold ( t \in (0, 1) ).
This is equivalent to finding all maximal cliques in the thresholded correlation graph, where:
A maximal clique corresponds to a variable subset that cannot be extended without violating the correlation limit.
The ELS algorithm efficiently enumerates all maximal cliques in a sparse graph using degeneracy ordering:
Formally, define:
[ \text{extend}(S, C) = \begin{cases} S, & C = \emptyset, \ \bigcup_{v \in C} \text{extend}(S \cup {v},\ C \setminus (N(v) \cup {v})), & \text{otherwise}. \end{cases} ]
ELS avoids redundant exploration, achieving good performance on typical correlation graphs.
The classical Bron–Kerbosch algorithm enumerates maximal cliques via recursive backtracking with optional pivoting:
Let ( R ) = current clique, ( P ) = prospective nodes, ( X ) = excluded nodes. Then:
[ \text{BK}(R, P, X) = \begin{cases} \text{report}(R), & P = X = \emptyset, \ \text{for each } v \in P \setminus N(u): \ \quad \text{BK}(R \cup {v},\ P \cap N(v),\ X \cap N(v)), \ \quad P \leftarrow P \setminus {v},\ X \leftarrow X \cup {v}. \end{cases} ]
Choosing a pivot ( u \in P \cup X ) and iterating over ( P \setminus N(u) ) reduces recursive calls.
Most existing R tools:
findCorrelation
) corrselect uniquely provides:
CorrCombo
objects This makes it ideal for pipelines where interpretability and completeness are essential.
Convert results for downstream use:
df_res <- as.data.frame(res) head(df_res)
Extract individual subsets:
lapply(corrSubset(res, df, which = 1:2), function(x) head(x, 10))
Summarize correlation metrics:
# Number and size of subsets length(res@subset_list) summary(lengths(res@subset_list)) # Summaries of within-subset correlations summary(res@max_corr) summary(res@avg_corr)
A CorrCombo
S4 object contains:
subset_list
: list of character vectors (variable names) avg_corr
, min_corr
, max_corr
: numeric vectors of correlation metrics threshold
, forced_in
, search_type
, cor_method
, n_rows_used
use_pivot
(if applicable) Inspect slots:
str(res@subset_list)
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.