Home

/

CRAN

/

corrselect

/

Correlation Subset Selection with corrselect"

Correlation Subset Selection with corrselect"
In corrselect: Correlation-Based Variable Subset Selection

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(corrselect)

Overview

corrselect identifies all maximal subsets of variables whose pairwise correlations stay below a chosen threshold. This process reduces multicollinearity and redundancy before modeling, while preserving interpretability. Unlike greedy or stepwise approaches, corrselect exhaustively searches for all valid subsets using fast, exact algorithms. It is fully model-agnostic, making it suitable as a preprocessing step for regression, clustering, feature selection, and other analyses.

Given a threshold ( t \in (0,1) ), the functions corrSelect() (data-frame interface) and MatSelect() (matrix interface) enumerate all maximal subsets ( S ) of variables satisfying:

[ \forall i, j \in S,\ i \neq j: \ |r_{ij}| < t ]

where (r_{ij}) denotes the chosen correlation measure between variables (i) and (j). Enumeration relies on two exact graph-theoretic algorithms:

Eppstein–Löffler–Strash (ELS), a degeneracy-ordered backtracking algorithm optimized for sparse graphs.
Bron–Kerbosch (BK), a classical recursive clique-finding method, with optional pivoting to reduce search space.

Results are returned as a CorrCombo S4 object containing each subset’s variable names and summary statistics (avg_corr, min_corr, max_corr). You can then extract subsets from the original data via corrSubset(). Because the procedure does not depend on any downstream model, it cleanly separates “feature curation” from “model fitting” and supports multiple correlation measures (pearson, spearman, kendall, bicor, distance, maximal).

Quick Start (`CorrSelect`)

Simulated numeric example

set.seed(42)
n <- 100
df <- data.frame(
  A = rnorm(n),
  B = rnorm(n),
  C = rnorm(n),
  D = rnorm(n),
  E = rnorm(n)
)
df$F <- df$A * 0.9 + rnorm(n, sd = 0.1)  # strongly correlated with A

Basic selection

res <- corrSelect(df, threshold = 0.7)
res
as.data.frame(res)

corrSubset(res, df, which = 1)[1:10,]

Forcing variables into all subsets

res2 <- corrSelect(df, threshold = 0.7, force_in = "A")
res2

Using a different correlation method

res3 <- corrSelect(df, threshold = 0.6, cor_method = "spearman")
res3

Matrix Interface (`MatSelect`)

If you already computed a correlation matrix or want to apply the method to precomputed correlations:

mat <- cor(df)
res4 <- MatSelect(mat, threshold = 0.7)
res4

Selecting subsets:

MatSelect(mat, threshold = 0.5)

Force variable 1 into every subset:

MatSelect(mat, threshold = 0.5, force_in = 1)

Mixed Data Types (`assocSelect`)

df_ass <- data.frame(
  height = rnorm(15, 170, 10),
  weight = rnorm(15, 70, 12),
  group  = factor(rep(LETTERS[1:3], each = 5)),
  score  = ordered(sample(c("low","med","high"), 15, TRUE))
)

# keep every subset whose internal associations ≤ 0.6
res5 <- assocSelect(df_ass, threshold = 0.6)
res5

Changing Correlation Method

By default, corrSelect() uses Pearson correlation. You can choose alternatives with the cor_method argument:

"pearson": linear correlation (default)
"spearman": rank-based monotonic association
"kendall": Kendall’s tau
"bicor": robust biweight midcorrelation (WGCNA::bicor)
"distance": distance correlation (energy::dcor)
"maximal": maximal information coefficient (minerva::mine)

Example:

res6 <- corrSelect(df, threshold = 0.7, cor_method = "spearman")
res6

Handling Mixed Data Types

The function assocSelect() extends corrSelect() to support mixed data types — including numeric, factor, and ordered variables — by using appropriate association measures for each variable pair.

Instead of a single correlation matrix, it constructs a generalized association matrix using the following logic:

| Variable 1 | Variable 2 | Method Used | |------------------|------------------|----------------| | numeric | numeric | pearson (default; customizable) | | numeric | factor | eta | | numeric | ordered | spearman (default; customizable) | | factor | factor | cramersv | | factor | ordered | cramersv | | ordered | ordered | spearman (default; customizable) |

The defaults for numeric-numeric, numeric-ordered, and ordered-ordered associations can be changed via arguments:

assocSelect(df_ass,
  method_num_num = "kendall",
  method_num_ord = "spearman",
  method_ord_ord = "kendall"
)

All other combinations use fixed methods (eta or cramersv) appropriate for measuring association strength.

Example with Mixed Types

df_ass <- data.frame(
  height = rnorm(10),
  weight = rnorm(10),
  group  = factor(sample(c("A", "B"), 10, replace = TRUE)),
  score  = ordered(sample(1:3, 10, replace = TRUE))
)

res7 <- assocSelect(df_ass, threshold = 1, method = "bron-kerbosch", use_pivot = TRUE)
res7

Each pairwise association is bounded to [0,1] and treated analogously to correlation.

Theory

Given a symmetric correlation matrix ( R \in \mathbb{R}^{p \times p} ), we seek all maximal subsets ( S \subseteq {1, \dots, p} ) such that:

[ \forall i, j \in S,\ i \neq j: \ |R_{ij}| < t ]

for a fixed threshold ( t \in (0, 1) ).

This is equivalent to finding all maximal cliques in the thresholded correlation graph, where:

Nodes represent variables
Edges connect nodes whose absolute correlation is below the threshold

A maximal clique corresponds to a variable subset that cannot be extended without violating the correlation limit.

Algorithms

ELS (Eppstein–Löffler–Strash)

The ELS algorithm efficiently enumerates all maximal cliques in a sparse graph using degeneracy ordering:

Compute a degeneracy ordering ( v_1, \dots, v_p ).
For each ( i ), extend current clique ( S ) from ( {v_i} ) within candidate set ( C = {v_{i+1}, \dots, v_p} ).
Recursively build cliques, pruning when no further vertices can be added.

Formally, define:

[ \text{extend}(S, C) = \begin{cases} S, & C = \emptyset, \ \bigcup_{v \in C} \text{extend}(S \cup {v},\ C \setminus (N(v) \cup {v})), & \text{otherwise}. \end{cases} ]

ELS avoids redundant exploration, achieving good performance on typical correlation graphs.

Bron–Kerbosch (with Pivoting)

The classical Bron–Kerbosch algorithm enumerates maximal cliques via recursive backtracking with optional pivoting:

Let ( R ) = current clique, ( P ) = prospective nodes, ( X ) = excluded nodes. Then:

[ \text{BK}(R, P, X) = \begin{cases} \text{report}(R), & P = X = \emptyset, \ \text{for each } v \in P \setminus N(u): \ \quad \text{BK}(R \cup {v},\ P \cap N(v),\ X \cap N(v)), \ \quad P \leftarrow P \setminus {v},\ X \leftarrow X \cup {v}. \end{cases} ]

Choosing a pivot ( u \in P \cup X ) and iterating over ( P \setminus N(u) ) reduces recursive calls.

Why corrselect?

Most existing R tools:

Filter one variable at a time (e.g. findCorrelation)
Use greedy or backward-selection heuristics
Do not enumerate all valid subsets

corrselect uniquely provides:

Exact enumeration of maximal subsets
Support for multiple correlation measures
Optional forcing of variables
Full inspection via CorrCombo objects
Fast C++ implementations via Rcpp

This makes it ideal for pipelines where interpretability and completeness are essential.

Inspecting Results

Convert results for downstream use:

df_res <- as.data.frame(res)
head(df_res)

Extract individual subsets:

lapply(corrSubset(res, df, which = 1:2), function(x) head(x, 10))

Summarize correlation metrics:

# Number and size of subsets
length(res@subset_list)
summary(lengths(res@subset_list))

# Summaries of within-subset correlations
summary(res@max_corr)
summary(res@avg_corr)

CorrCombo Object Structure

A CorrCombo S4 object contains:

subset_list: list of character vectors (variable names)
avg_corr, min_corr, max_corr: numeric vectors of correlation metrics
threshold, forced_in, search_type, cor_method, n_rows_used
Attribute use_pivot (if applicable)

Inspect slots:

str(res@subset_list)

Session Info

sessionInfo()

Any scripts or data that you put into this service are public.

corrselect documentation built on Sept. 9, 2025, 5:52 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

corrselect
Correlation-Based Variable Subset Selection

Correlation Subset Selection with corrselect"
In corrselect: Correlation-Based Variable Subset Selection

Overview

Quick Start (`CorrSelect`)

Simulated numeric example

Basic selection

Forcing variables into all subsets

Using a different correlation method

Matrix Interface (`MatSelect`)

Mixed Data Types (`assocSelect`)

Changing Correlation Method

Handling Mixed Data Types

Example with Mixed Types

Theory

Algorithms

ELS (Eppstein–Löffler–Strash)

Bron–Kerbosch (with Pivoting)

Why corrselect?

Inspecting Results

CorrCombo Object Structure

Session Info

Try the corrselect package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

corrselect Correlation-Based Variable Subset Selection

Correlation Subset Selection with corrselect" In corrselect: Correlation-Based Variable Subset Selection

Overview

Quick Start (CorrSelect)

Simulated numeric example

Basic selection

Forcing variables into all subsets

Using a different correlation method

Matrix Interface (MatSelect)

Mixed Data Types (assocSelect)

Changing Correlation Method

Handling Mixed Data Types

Example with Mixed Types

Theory

Algorithms

ELS (Eppstein–Löffler–Strash)

Bron–Kerbosch (with Pivoting)

Why corrselect?

Inspecting Results

CorrCombo Object Structure

Session Info

Try the corrselect package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

corrselect
Correlation-Based Variable Subset Selection

Correlation Subset Selection with corrselect"
In corrselect: Correlation-Based Variable Subset Selection

Quick Start (`CorrSelect`)

Matrix Interface (`MatSelect`)

Mixed Data Types (`assocSelect`)