Correlation Subset Selection with corrselect"

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(corrselect)

Overview

corrselect identifies all maximal subsets of variables whose pairwise correlations stay below a chosen threshold. This process reduces multicollinearity and redundancy before modeling, while preserving interpretability. Unlike greedy or stepwise approaches, corrselect exhaustively searches for all valid subsets using fast, exact algorithms. It is fully model-agnostic, making it suitable as a preprocessing step for regression, clustering, feature selection, and other analyses.

Given a threshold ( t \in (0,1) ), the functions corrSelect() (data-frame interface) and MatSelect() (matrix interface) enumerate all maximal subsets ( S ) of variables satisfying:

[ \forall i, j \in S,\ i \neq j: \ |r_{ij}| < t ]

where (r_{ij}) denotes the chosen correlation measure between variables (i) and (j). Enumeration relies on two exact graph-theoretic algorithms:

  1. Eppstein–Löffler–Strash (ELS), a degeneracy-ordered backtracking algorithm optimized for sparse graphs.
  2. Bron–Kerbosch (BK), a classical recursive clique-finding method, with optional pivoting to reduce search space.

Results are returned as a CorrCombo S4 object containing each subset’s variable names and summary statistics (avg_corr, min_corr, max_corr). You can then extract subsets from the original data via corrSubset(). Because the procedure does not depend on any downstream model, it cleanly separates “feature curation” from “model fitting” and supports multiple correlation measures (pearson, spearman, kendall, bicor, distance, maximal).


Quick Start (CorrSelect)

Simulated numeric example

set.seed(42)
n <- 100
df <- data.frame(
  A = rnorm(n),
  B = rnorm(n),
  C = rnorm(n),
  D = rnorm(n),
  E = rnorm(n)
)
df$F <- df$A * 0.9 + rnorm(n, sd = 0.1)  # strongly correlated with A

Basic selection

res <- corrSelect(df, threshold = 0.7)
res
as.data.frame(res)
corrSubset(res, df, which = 1)[1:10,]

Forcing variables into all subsets

res2 <- corrSelect(df, threshold = 0.7, force_in = "A")
res2

Using a different correlation method

res3 <- corrSelect(df, threshold = 0.6, cor_method = "spearman")
res3

Matrix Interface (MatSelect)

If you already computed a correlation matrix or want to apply the method to precomputed correlations:

mat <- cor(df)
res4 <- MatSelect(mat, threshold = 0.7)
res4

Selecting subsets:

MatSelect(mat, threshold = 0.5)

Force variable 1 into every subset:

MatSelect(mat, threshold = 0.5, force_in = 1)

Mixed Data Types (assocSelect)

df_ass <- data.frame(
  height = rnorm(15, 170, 10),
  weight = rnorm(15, 70, 12),
  group  = factor(rep(LETTERS[1:3], each = 5)),
  score  = ordered(sample(c("low","med","high"), 15, TRUE))
)

# keep every subset whose internal associations ≤ 0.6
res5 <- assocSelect(df_ass, threshold = 0.6)
res5

Changing Correlation Method

By default, corrSelect() uses Pearson correlation. You can choose alternatives with the cor_method argument:

Example:

res6 <- corrSelect(df, threshold = 0.7, cor_method = "spearman")
res6

Handling Mixed Data Types

The function assocSelect() extends corrSelect() to support mixed data types — including numeric, factor, and ordered variables — by using appropriate association measures for each variable pair.

Instead of a single correlation matrix, it constructs a generalized association matrix using the following logic:

| Variable 1 | Variable 2 | Method Used | |------------------|------------------|----------------| | numeric | numeric | pearson (default; customizable) | | numeric | factor | eta | | numeric | ordered | spearman (default; customizable) | | factor | factor | cramersv | | factor | ordered | cramersv | | ordered | ordered | spearman (default; customizable) |

The defaults for numeric-numeric, numeric-ordered, and ordered-ordered associations can be changed via arguments:

assocSelect(df_ass,
  method_num_num = "kendall",
  method_num_ord = "spearman",
  method_ord_ord = "kendall"
)

All other combinations use fixed methods (eta or cramersv) appropriate for measuring association strength.

Example with Mixed Types

df_ass <- data.frame(
  height = rnorm(10),
  weight = rnorm(10),
  group  = factor(sample(c("A", "B"), 10, replace = TRUE)),
  score  = ordered(sample(1:3, 10, replace = TRUE))
)

res7 <- assocSelect(df_ass, threshold = 1, method = "bron-kerbosch", use_pivot = TRUE)
res7

Each pairwise association is bounded to [0,1] and treated analogously to correlation.


Theory

Given a symmetric correlation matrix ( R \in \mathbb{R}^{p \times p} ), we seek all maximal subsets ( S \subseteq {1, \dots, p} ) such that:

[ \forall i, j \in S,\ i \neq j: \ |R_{ij}| < t ]

for a fixed threshold ( t \in (0, 1) ).

This is equivalent to finding all maximal cliques in the thresholded correlation graph, where:

A maximal clique corresponds to a variable subset that cannot be extended without violating the correlation limit.


Algorithms

ELS (Eppstein–Löffler–Strash)

The ELS algorithm efficiently enumerates all maximal cliques in a sparse graph using degeneracy ordering:

  1. Compute a degeneracy ordering ( v_1, \dots, v_p ).
  2. For each ( i ), extend current clique ( S ) from ( {v_i} ) within candidate set ( C = {v_{i+1}, \dots, v_p} ).
  3. Recursively build cliques, pruning when no further vertices can be added.

Formally, define:

[ \text{extend}(S, C) = \begin{cases} S, & C = \emptyset, \ \bigcup_{v \in C} \text{extend}(S \cup {v},\ C \setminus (N(v) \cup {v})), & \text{otherwise}. \end{cases} ]

ELS avoids redundant exploration, achieving good performance on typical correlation graphs.

Bron–Kerbosch (with Pivoting)

The classical Bron–Kerbosch algorithm enumerates maximal cliques via recursive backtracking with optional pivoting:

Let ( R ) = current clique, ( P ) = prospective nodes, ( X ) = excluded nodes. Then:

[ \text{BK}(R, P, X) = \begin{cases} \text{report}(R), & P = X = \emptyset, \ \text{for each } v \in P \setminus N(u): \ \quad \text{BK}(R \cup {v},\ P \cap N(v),\ X \cap N(v)), \ \quad P \leftarrow P \setminus {v},\ X \leftarrow X \cup {v}. \end{cases} ]

Choosing a pivot ( u \in P \cup X ) and iterating over ( P \setminus N(u) ) reduces recursive calls.


Why corrselect?

Most existing R tools:

corrselect uniquely provides:

This makes it ideal for pipelines where interpretability and completeness are essential.


Inspecting Results

Convert results for downstream use:

df_res <- as.data.frame(res)
head(df_res)

Extract individual subsets:

lapply(corrSubset(res, df, which = 1:2), function(x) head(x, 10))

Summarize correlation metrics:

# Number and size of subsets
length(res@subset_list)
summary(lengths(res@subset_list))

# Summaries of within-subset correlations
summary(res@max_corr)
summary(res@avg_corr)

CorrCombo Object Structure

A CorrCombo S4 object contains:

Inspect slots:

str(res@subset_list)

Session Info

sessionInfo()


Try the corrselect package in your browser

Any scripts or data that you put into this service are public.

corrselect documentation built on Sept. 9, 2025, 5:52 p.m.