knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
To load the package, you can use the below command:
library(FeatureTerminatoR) library(caret) library(dplyr) library(ggplot2) library(randomForest)
The trick to this is to use cross validation, or repeated cross validation, to eliminate n features from the model. This is achieved by fitting the model multiple times at each step, removing the weakest features, determining by either the coefficients in the model, or by the feature importance attributes in the model.
Within the package there is a number of different types you can utilise:
See the underlying caretFuncs() documentation.
The model implements all these methods. I will utilise the random forest variable importance selection method, as this is quick to train on our test dataset.
The following steps will take you through how to use this function.
For the test data we will use the in built iris dataset.
df <- iris print(head(df,10))
Now is the time to use the workhouse function for the RFE (Recursive Feature Elimination) methods:
#Passing in the indexes as slices x values located in index 1:4 and y value in location 5 rfe_fit <- rfeTerminator(df, x_cols= 1:4, y_cols=5, alter_df = TRUE, eval_funcs = rfFuncs) #Passing by column name rfe_fit_col_name <- rfeTerminator(df, x_cols=1:4, y_cols="Species", alter_df=TRUE) # A further example ref_x_col_name <- rfeTerminator(df, x_cols=c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width"), y_cols = "Species")
This shows that it does not matter how you pass the data to the function, but the x column names need to be wrapped in a vector, as the further example highlights. Otherwise, you can simply pass the columns as a slice of the data frame.
The model will select the best combination of values, with the sizes argument indicating the range of numeric features to retain. This defaults to an integer column slice between 1:10.
#Explore the optimal model results print(rfe_fit$rfe_model_fit_results) #View the optimum variables selected print(rfe_fit$rfe_model_fit_results$optVariables)
The following list type will retain the original data, with the alter_df
argument indicating if the results should be outputted for manual evaluation of the backward elimination, or whether the data frame should be reduced. This could be the full data before a training / testing split, or on the training set, dependent on your ML pipeline strategy.
To view the original data:
#Explore the original data passed to the frame print(head(rfe_fit$rfe_original_data))
Viewing the outputs post termination, you can observe that the features that have little bearing on the dependent (predicted variable) are terminated:
#Explore the data adapted with the less important features removed print(head(rfe_fit$rfe_reduced_data))
The features that do not have a significant impact have been removed from your model and this would surely speed up the ML or predictive model prior to training it.
Next, we move on to another feature selection method, this time we are utilising a correlation method to remove potential affects of multicollinearity
.
The main reason you would want to do this is to avoid multicollinearity. This is an effect caused when there are high intercorrelations among two or more independent variables in linear models, this is not so much of a problem with non-linear models, such as trees, but can still cause high variance in the models, thus scaling of independent variables is always recommended.
In general, multicollinearity can lead to wider confidence intervals that produce less reliable probabilities in terms of the effect of independent variables in a model. That is, the statistical inferences from a model with multicollinearity may not be dependable.
Key takeaways:
This is why you would want to remove highly correlated features.
We already have our test data loaded in, and we will use the dataset from the previous example in this example.
#Fit a model on the results and define a confidence cut off limit mc_term_fit <- FeatureTerminatoR::mutlicol_terminator(df, x_cols=1:4, y_cols="Species", alter_df=TRUE, cor_sig = 0.90)
Exploring the outputs:
# Visualise the quantile distributions of where the correlations lie mc_term_fit$corr_quant_chart
This shows that our cut off range starts at about the 85th percentile of the correlation distributions, at the top end. This would also work for strong negative associations. Here, we could probably be a little more strict in our 90% limit, but we will keep it at this for now, as we do not want to purge all the features.
This has been built into the tool for ease:
# View the correlation matrix mc_term_fit$corr_matrix # View the covariance matrix mc_term_fit$cov_matrix # View the quantile range mc_term_fit$corr_quantile #This excludes the diagonal correlations, as this would inflate the quantile distribution
There is some strong correlations between petal length and petal width, so these will be clipped by our choice of cut-off.
To get the outputs from the feature selection method, we use the following call to obtain the output tibble:
# Get the removed and reduced data new_df_post_feature_removal <- mc_term_fit$feature_removed_df glimpse(new_df_post_feature_removal)
Here, the algorithm has removed a value based off the cut-off limit provided.
These algorithms will form the first version of the package, but still to be developed are:
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.