FCBF : Fast Correlation Based Filter for Feature Selection
In FCBF: Fast Correlation Based Filter for Feature Selection

library(knitr)
knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

The FCBF package is a R implementation of an algorithm developed by Yu and Liu, 2003 : Feature Selection for High-Dimensional Data: A Fast Correlation-Based Filter Solution.

The algorithm described in the article and implemented here uses the idea of "predominant correlation". It selects features in a classifier-independent manner, selecting features with high correlation with the target variable, but little correlation with other variables. Notably, the correlation used here is not the classical Pearson or Spearman correlations, but Symmetrical Uncertainty (SU).

The symmetrical uncertainty is based on information theory, drawing from the concepts of Shannon entropy and information gain. A detailed description is outside the scope of this vignette, but these details are described comprehensively in the Yu and Liu, 2003.

Initially, the algorithm selects features correlated above a given threshold by SU with the class variable. After this initial filtering, it detects predominant correlations of features with the class. The definition is that, for a predominant feature "X", no other feature is more correlated to "X" than "X"" is to the class.

The features more correlated with X than with the class are then tested, and either X, or any other feature from this correlation group, emerges as the predominant correlation feature. Once more, the detailed description is available in Yu and Liu, 2003.

Usage

Discretizing gene expression

The first step to use FCBF is to install FCBF and load the sample data. Expression data from single macrophages is used as an example of how to apply the method to expression data.

BiocManager::install("FCBF")

library("FCBF")
library(SummarizedExperiment)
# load expression data
data(scDengue)
exprs <- SummarizedExperiment::assay(scDengue,'logcounts')
head(exprs[,1:4])

This normalized single cell expression can be used in machine learning (ML) models to obtain classifiers for dengue x control macrophage. One of the ideas behind the ML approach to RNA-Seq data is to use features that are important for classification as biomarkers, for experimental characterization and inference of involved pathways. Overfitting and extensive computational time are, thus big problems that feature selection intends to help.

For the feature selection approach implemented here, the gene expression has to be discretized, as the entropy concept behind the symmetrical uncertainty is built upon discrete classes.

There is no consensus of the optimal way of discretizing gene expression, and this is even more true to single cell RNA-Seq. The \code{discretize_exprs} function simply gets the range of expression (min to max) for each gene and assigns the first third the label of "low" and, to the other two, the label of "high". In this way, all the zeros and low expression values are assigned the label "low" in a gene-dependent manner.

Notably, this step is only done for variable selection. When you fit machine learning models with those variables, you can still use the original, continuous, numeric, features.

# discretize expression data
discrete_expression <- as.data.frame(discretize_exprs(exprs))

head(discrete_expression[,1:4])

Selecting features

The filter approaches as FCBF do not target a specific machine learning model. However, they are supervised approaches, requiring you to give a target variable as input. In this case, we have two classes: dengue infected and not infected macrophages.

#load annotation data
infection <- SummarizedExperiment::colData(scDengue)


# get the class variable
#(note, the order of the samples is the same as in the exprs file)
target <- as.factor(infection$infection)

Having a class variable and a discrete, categorical feature table, we can run the function fcbf. The code as follows would run for a minute or two and generate an error.

# you only need to run that if you want to see the error for yourself
# fcbf_features <- fcbf(discrete_expression, target, verbose = TRUE)

As you can see, we get an error.

\code{"Error in fcbf(discrete_expression, target, verbose = TRUE) : No prospective features for this threshold level. Threshold: 0.25"}

That is because the built-in threshold for the SU, 0.25, is too high for this data set. You can use the function \code{su_plot} to decide for yourself which threshold is the best. In this way you can see the distribution of correlations (as calculated from symmetrical uncertainty) for each variable with the target classes.

su_plot(discrete_expression,target)

Seems like 0.05 is something reasonable for this dataset. This changes, of course, the number of features selected. A lower threshold will explore more features, but they will have less relation to the target variable. Let's run it again with the new threshold.

fcbf_features <- fcbf(discrete_expression, target, thresh = 0.05, verbose = TRUE)

In the end, this process has selected a very low number of variables in comparison with the original dataset. Now we should see if they are really useful for classification. First, we will get a 'mini feature table', just with the selected variables and a table for comparison, with the 100 most variable genes

mini_single_cell_dengue_exprs <- exprs[fcbf_features$index,]

vars <- sort(apply(exprs, 1, var, na.rm = TRUE), decreasing = TRUE)
data_top_100_vars <- exprs[names(vars)[1:100], ]

The top 100 most variable is a quick-and-dirty unsupervised feature selection. Nevertheless, we can use it to see if the genes selected are really better at classification tasks. With the packages 'caret' and 'mlbench' we can check if that is true.

#first transpose the tables and make datasets as caret likes them

dataset_fcbf <- cbind(as.data.frame(t(mini_single_cell_dengue_exprs)),target_variable = target)
dataset_100_var <- cbind(as.data.frame(t(data_top_100_vars)),target_variable = target)

library('caret')
library('mlbench')

control <- trainControl(method="cv", number=5, classProbs=TRUE, summaryFunction=twoClassSummary)

In the code above we created a plan to test the classifiers by 5-fold cross validation. Any classifier can be used, but for illustration, here we will use the very popular radial svm. As metric for comparison we will use the area-under-the-curve (AUC) of the receiver operating characteristic (ROC) curve :

svm_fcbf <-
  train(target_variable ~ .,
        metric="ROC",
        data = dataset_fcbf,
        method = "svmRadial",
        trControl = control)

svm_top_100_var <-
  train(target_variable ~ .,
        metric="ROC",
        data = dataset_100_var,
        method = "svmRadial",
        trControl = control)

svm_fcbf_results <- svm_fcbf$results[svm_fcbf$results$ROC == max(svm_fcbf$results$ROC),]
svm_top_100_var_results <- svm_top_100_var$results[svm_top_100_var$results$ROC == max(svm_top_100_var$results$ROC),]

cat(paste0("For top 100 var: \n",
  "ROC = ",  svm_top_100_var_results$ROC, "\n",
             "Sensitivity  = ", svm_top_100_var_results$Sens, "\n",
             "Specificity  = ", svm_top_100_var_results$Spec, "\n\n",
  "For FCBF: \n",
  "ROC = ",  svm_fcbf_results$ROC, "\n",
             "Sensitivity  = ", svm_fcbf_results$Sens, "\n",
             "Specificity  = ", svm_fcbf_results$Spec))

As we can see, the selected variables gave rise to a better SVM model than selecting the top 100 variables (as measured by the area under the curve of the receiver-operating characteristic curve. Notably, running the svm radial with the full exprs set gives an error of the sort : Error: protect(): "protection stack overflow".

The multiplicity of infection (MOI) of the cells was relatively low (1 virus/cell), so it is expected that some cells were not infected. Thus, due to this overlap o f healthy cells in both groups, the groups are probably non-separable. In any case, FCBF seems to do a reasonably good job at selecting variables.

Final remarks

Even though tools are available for feature selection in packages such as FSelector (https://cran.r-project.org/web/packages/FSelector/FSelector.pdf), up to date, there were no easy implementations in R for FCBF. The article describing FCBF has more than 1800 citations, but almost none from the biomedical community.

We expect that by carefully implementing and documenting FCBF in R, this package might improve usage of filter-based feature selection approaches that aim at reducing redundancy among selected features. Other tools similar to FCBF are available in Weka (https://www.cs.waikato.ac.nz/ml/weka/) and Python/scikit learn (http://featureselection.asu.edu/). A recent good review, for those interested in going deeper, is the following ref:

Li, J., Cheng, K., Wang, S., Morstatter, F., Trevino, R. P., Tang, J., & Liu, H. (2017). Feature selection: A data perspective. ACM Computing Surveys (CSUR), 50(6), 94.

We note that other techniques based on predominant correlation might be better depending on the dataset and the objectives. FCBF provides a interpretable and robust option, with results that are generally good.

The application of filter-based feature selections for big data analysis in the biomedical sciences can have not only a direct effect in classification efficiency but might lead to interesting biological interpretations and possible quick identification of biomarkers.