knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Identify the best machine learning binary classification model

When creating binary classification models, you must:

+ identify your desired machine learning models

+ tune the parameters for each model

+ train each model on your training data

+ test the models on your testing data

+ compare the results to find the best model

bestclassifier facilitates this complex, arduous process by allowing you to complete all of those tasks in one function

+ This function supports eight elite machine learning binary classification models, including:

    - logistic regression

    - lasso regression

    - random forest

    - extreme gradient boosting

    - support vector machine

    - artificial neural network

    - latent dirichlet allocation

    - k nearest neighbors

Parameters in the best.classifier function

data

form

p

method

number

repeats

tuneLength

positive

model

set_seed

subset_train

desired_metric

Data: CCD_data

In order to explore the best.classifier function, we will use the CCD dataset. This dataset contains default status and payment information for all credit card customers transacting with a Taiwanese bank in 2005.

CCD <- bestclassifier::CCD
str(CCD)

Interpreting the output generated by best.classifier

In the example below, I am seeking the machine learning model that produces the highest AUC when classifying credit card default. These models will be predicting the "Default" category in the Class variable by using all of the predictors in the dataset. Because the CCD data contains nearly 30,000 observations, I am training the model on 1% of the training dataset for fast results.

library(bestclassifier)
bestclassifier(data = CCD, form = default.payment.next.month ~ ., p = 0.7, method = 
"repeatedcv", number = 5, repeats = 1, tuneLength = 5, 
positive ="Default", model = c("log_reg", "lasso", "lda", 
"svm", "lda", "knn", "ann", 
"xgboost"),
set_seed = 1234, subset_train = .01, desired_metric = "ROC")

Understanding the Bar Graph

According to the bar graph, the lasso regression model performed the best on the training data, depicting an AUC of .6495.

Analyzing the Confusion Matrix

Random Forest results on testing data:

+ Accuracy: 78.2%

+ Sensitivity: 3.1%

+ Specificity: 99.6%

+ Positive Predictive Value: 67.8%

+ Negative Predictive Value: 78.3%

Accuracy

Sensitivity

Specificity

Positive Predictive Value

Negative Predictive Value



sross15/bestclassifier documentation built on May 23, 2019, 7:19 a.m.