README.md

featureselection package for R

Feature Selection with Machine Learning Models

R build
status codecov

Overview

If you have encountered a dataset with a myriad number of features, it could be very difficult to work with them all. Some features may not even be important or relevant and could even cause optimization bias. One approach to this problem is to select a subset of these features for your model. Feature selection will reduce complexity, reduce the time when training an algorithm, and improve the accuracy of your model (if we select them wisely). However, this is not a trivial task.

The featureselection package for R can help you with this task. It is similar to its companion package featureselection Package for Python.

Features

In this package, four functions are included for feature selection:

Existing Ecosystems

Some of the above features already exsist within the R ecosystem but are provided for feature parity with Python edition of this package.

Installation

Make sure you have the devtools package installed. You can install it as follows.

#Install development version from Github
install.packages("devtools")

Then, install the feature selection package.

devtools::install_github("UBC-MDS/feature-selection-r")

Dependencies

Usage

The Friedman dataset is used to generate data for some of the examples. The datasets contain some features that are generated by a whitenoise process and are expected to be eliminated during feature selection.

NOTE: To run the examples below, you will need the tgp package. It can be installed from the R console with:

install.packages("tgp")

Load Dataset

data <- dplyr::select(tgp::friedman.1.data(), -Ytrue)
X <- dplyr::select(data, -Y)
y <- dplyr::select(data, Y)

forward_selection

#
# Create a 'scorer' that accepts a dataset
# and returns the Mean Squared Error.
#
custom_scorer <- function(data){
  model <- lm(Y ~ ., data)
  return(mean(model$residuals^2))
}

featureselection::forward_selection(custom_scorer, X, y, 3, 7)
#> [1] 4 1 2 5

recursive_feature_elimination

#
# Create a custom 'scorer' that accepts a dataset and returns
# the name of the column with the lowest coefficient weight.
#
custom_scorer <- function(data){
  model <- lm(Y ~ ., data)
  names(which.min(model$coefficients[-1]))[[1]]
}

featureselection::recursive_feature_elimination(custom_scorer, X, y, 4)
#> [1] "X1" "X2" "X4" "X5" "Y"

simulated_annealing

#
# Create a 'scorer' that accepts a dataset
# and returns the Mean Squared Error.
#
custom_scorer <- function(data){
  model <- lm(Y ~ ., data)
  return(mean(model$residuals^2))
}

featureselection::simulated_annealing(custom_scorer, X, y)
#> [1]  1  2  3  4  5  7  9 10

variance_thresholding

#
# sample data to test variance
#
data <- data.frame(x1=c(1, 2, 3, 4, 5),
                   x2=c(0, 0, 0, 0, 0),
                   x3=c(1, 1, 1, 1, 1))

featureselection::variance_thresholding(data)
#> [1] 1

Documentation

The official documentation is hosted here: https://ubc-mds.github.io/feature-selection-r

Credits

This package was created with the assistance of the following packages: devtools, usethis, pkgdown



UBC-MDS/feature-selection-r documentation built on April 27, 2020, 7:21 p.m.