HybridFS: A Hybrid Feature Selection Function

Description Usage Arguments Details Value Note Examples

Description

HybridFS is a combination of filter and wrapper methods which uses a set of statistical tests for feature selection. Primary level feature reduction involves filtering based on statistical test such as Chi-Square test of Independence, Information value(IV) and Entropy-related methods. Features filtered at this level are further fed into a classification algorithm and final features of the optimal model is returned along with the feature importance.

Usage

1
HybridFS(input.df, target.var.name)

Arguments

input.df

Input data frame that contains the target variable and predictor variables with no missing values. Predictors can be either categorical or continuous.Unique identifier,if present should be named "ID".

target.var.name

Name of binary target variable. Target variables should be integer with only two distinct values (0, 1)

Details

Binning of Continuous Predictors
Supervised Binning of continuous predictors reduces computational time, improves model performance and predictive power. Binning is implemented based on similar weight of evidence (WOE) values and information value (IV). Transformed dataset with binned copy of continuous variables is then fed into the Hybrid filter-Wrapper algorithm. Continuous features selected are returned as binned variables (e.g. average_volume is returned as average_volume.binned). To retrieve the transformed dataset, use FinalBinnedData() function.

Level1 Feature Reduction - Filter Method
Chi-Square test of Independence, Information value(IV) and Entropy-related methods such as Information Gain, Gain Ratio and Symmetrical Uncertainty are used to generate variable importance scores. Top n features are dynamically selected and different subsets are formed based on relative ranking from each of the filter methods.

Level2 Feature Reduction - Wrapper Method
Different subsets of variables from the first level are trained using a classification algorithm. Optimum probability cut-off for the target class is determined by the K-S Statistic. Combination of Area Under the Curve(AUC) and F-score (F1 score) are used as the benchmark metrics to measure the model performance. Best set of features with variable importance and rank from the optimal model is returned. Out-of-Sample Validation results are also displayed to understand the stability of the optimal model selected.

Value

An object of class FS, which is a list with the following components:

imp.features

A data frame of the selected features from the optimal model retuned with the relative rank.Variable importance plot for top 10 variables selected is displayed.Continuous features selected are returned as binned variables (e.g. average_volume is returned as average_volume.binned)

model.perf

Performance metrics of the optimal model such as F1 Score, Accuracy, Precision and Recall are returned

Note

Requires latest version of Java(8 and above)

Examples

1
FS=HybridFS(input.df=validation,target.var.name="Survived")

HybridFS documentation built on June 11, 2019, 5:02 p.m.

Related to HybridFS in HybridFS...