knitr::opts_chunk$set(echo = TRUE)

Introduction

feseR provides funcionalities to combine multiple Feature Selection (FS) methods to analyze high-dimensional omics data in R environment. The different feature selection steps can be classificated in: Univariate (Correlation filter and Gain Information), Multivariate (Principal Component Analysis and Matrix Correlation based) and Recursive Feature Elimination (wrapped up with a Machine Learning algorithm). The goal is to assemble the different steps in an efficient workflow to perform feature selection task in the context of classification and regression problems. The package includes also several example dataset.

Available dataset

We provide some example dataset (Transcriptomics and Proteomics) with the package. Some general description of the data are listed bellow:

Note: Datasets are expected to be a matrix with features in columns and samples in rows.

Examples

Preparing your data

  library(feseR)

   # loading example data (TNBC)
   data(TNBC)

   # getting features
   features <- TNBC[,-ncol(TNBC)]

   # getting class variable (expected last column)
   class <- TNBC[,ncol(TNBC)]

   # pre-filtering
   # keep only features (cols) with maximal missing rate 0.25 across samples (rows)
   features <- filterMissingnessRate(features, max_missing_rate = 0.25)

   # impute missing values
   features <- imputeMatrix(features, method = "mean")

   # Scale data features. These transformations coerce the original predictors 
   # to have zero mean and standard deviation equal one.
   features <- scale(features, center=TRUE, scale=TRUE)

Univariate filter examples

  # filtering by correlation
  output <- filter.corr(features = features, class = class, mincorr = 0.3)

  # filtering by gain information
  output <- filter.gain.inf(features = features, class = class, zero.gain.out = TRUE)

Multivariate filter examples

  # filtering by matrix correlation (cutoff 0.75)
  output <- filter.matrix.corr(features = features, maxcorr = 0.75)

  # data dimension reduction using PCA (return only PCs explaining 95% of the variance)
  output <- filter.pca(features = features, cum.var.cutoff = .95)

Combining Feature Selection methods

This function allows to combine multiple feature selection methods in a workflow

  # combining filter univariate corr., multivariate matrix corr. and
  # recursive feature elimination wrapped with random forest
   results <- combineFS(features = features, class = class,
                        univariate = 'corr', mincorr = 0.3,
                        multivariate = 'mcorr', maxcorr = 0.75,
                        wrapper = 'rfe.rf', number.cv = 10, 
                        group.sizes = seq(1,100,10), 
                        verbose = F, extfolds = 10)


   # getting the metrics from the training process
   training_results <- results$training

   # getting the metrics from the testing process
   testing_results <- results$testing

\newpage

Results from the training phase

   pander::pandoc.table(training_results, digits = 4,  split.table = Inf,
                       caption = 'Best model metrics from 10-folds cross-validation resampling.')

\newpage

Results from the testing phase

   pander::pandoc.table(testing_results, digits = 4,  split.table = Inf,
                       caption = 'Classification metrics from ten class-balanced and randomized runs.')

\newpage

Visualizing the Feature Selection process

# plot PCA (PC1 vs. PC2)
plot_pca(features = features, class = class, list.plot = FALSE)

\newpage

# getting the filtered matrix
filtered.features <- features[,results$opt.variables]

# plot PCA (PC1 vs. PC2)
plot_pca(features = filtered.features, class = class, list.plot = FALSE)

\newpage

# plot correlation matrix
plot_corr(features = filtered.features, corr.method = 'pearson')


enriquea/feseR documentation built on March 30, 2021, 4:12 p.m.