Home

/

GitHub

/

In sross15/bestclassifier: Identify Best Binary Classification Model

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Identify the best machine learning binary classification model

When creating binary classification models, you must:

+ identify your desired machine learning models

+ tune the parameters for each model

+ train each model on your training data

+ test the models on your testing data

+ compare the results to find the best model

bestclassifier facilitates this complex, arduous process by allowing you to complete all of those tasks in one function

+ This function supports eight elite machine learning binary classification models, including:

    - logistic regression

    - lasso regression

    - random forest

    - extreme gradient boosting

    - support vector machine

    - artificial neural network

    - latent dirichlet allocation

    - k nearest neighbors

Parameters in the best.classifier function

data

data specifies the dataframe on which the user would like to employ the binary classification models.
- Ex: data = CCD_data

form

form describes the formula the user would like to use to guide the binary classification models. The formula should be structured: dependent_variable ~ independent1 + independent2 + ...
- Ex: form = Class ~ .

p

p specifies the percentage of the dataset used to train the model. This parameter must be a number in between 0 and 1.
- Ex: p = .7

method

method specifies the resampling method instituted to train the binary classification models. The potential methods involve a series of boot and cross validation models, including boot, boot632, optimism_boot, boot_all, repeatedcv, LOOCV, LGOCV, none, oob, adaptive_cv, adaptive_boot, adaptive_LGOCV.
- Ex: method = "repeatedcv"

number

number indicates the amount of folds or repeated iterations employed to train the binary classification models. While more folds or sampling iterations will generate a more accurate result on the training dataset, increasing the number of folds can make the model suceptible to overfitting biases.
- Ex: number = 5

repeats

Solely pertinent to the repeated cross validation method, repeats indicates the number of complete sets of folds for k-fold cross validation. By creating more data partitions with k-subsets, the number parameter can dramatically reduce the variance of the training model but will also increase computation time.
- Ex: repeats = 1

tuneLength

tuneLength indicates the number of values the model will analyze for each tuning parameter in order to find the optimal tuning levels.
- tuneLength = 5

positive

positive reflects the argument of the dependent variable that the binary classification models aim to predict. This object must be a string.
- Ex: positive = "Default"

model

model indicates the specific binary classification models trained by bestclassifier. These options include logistic regresion, lasso regression, random forest, extreme gradient boosting, support vector machine, artificial neural network, latent dirichlet allocation, and k-nearest neighbors models.
- Ex: model = c("log_reg", "lasso", "rf", "svm", "xgboost", "ann", "lda", "knn")

set_seed

set_seed provides a reproducible random number generator so the user can run a model multiple times and receive identical results. This parameter must be a numeric object.
- Ex: set_seed = 1234

subset_train

After the data has been divided into a training and testing dataset, subset_train will reduce the size of the training data in order to generate the binary classification models more quickly. This parameter is used to decrease computation time when the training dataset is especially large and the user does not have time to wait for the results. However, performance can suffer when training a model based on a fraction of the training data. This parameter accepts numeric objects between 0 and 1.
- Ex: subset_train = .1

desired_metric

desired_metric specifies the metric employed to calculate the best binary classificaiton model. The available performance metrics include AUC or Accuracy.
- Ex: desired_metric = "ROC"
- Ex: desired_metric = "Accuracy"

Data: CCD_data

In order to explore the best.classifier function, we will use the CCD dataset. This dataset contains default status and payment information for all credit card customers transacting with a Taiwanese bank in 2005.

CCD <- bestclassifier::CCD
str(CCD)

Interpreting the output generated by best.classifier

In the example below, I am seeking the machine learning model that produces the highest AUC when classifying credit card default. These models will be predicting the "Default" category in the Class variable by using all of the predictors in the dataset. Because the CCD data contains nearly 30,000 observations, I am training the model on 1% of the training dataset for fast results.

library(bestclassifier)
bestclassifier(data = CCD, form = default.payment.next.month ~ ., p = 0.7, method = 
"repeatedcv", number = 5, repeats = 1, tuneLength = 5, 
positive ="Default", model = c("log_reg", "lasso", "lda", 
"svm", "lda", "knn", "ann", 
"xgboost"),
set_seed = 1234, subset_train = .01, desired_metric = "ROC")

Understanding the Bar Graph

According to the bar graph, the lasso regression model performed the best on the training data, depicting an AUC of .6495.

Analyzing the Confusion Matrix

Random Forest results on testing data:

+ Accuracy: 78.2%

+ Sensitivity: 3.1%

+ Specificity: 99.6%

+ Positive Predictive Value: 67.8%

+ Negative Predictive Value: 78.3%

Accuracy

Accuracy describes the total number of default statuses that the model classified correctly. In our model, an Accuracy of 78.2% indicates that the model correctly identified whether a customer did or did not default 78.2% of the time.

Sensitivity

Sensitivity indcates the percentage of total credit card defaults captured by the model. The model's sensitivity of 3.1% indicates that this model was able to predict 3.1% of the total defaults.

Specificity

Specificity depicts the percentage of total non-defaults detected by the model. The specificity value of 99.6% reveals that the model correctly classified 99.6% of clients who did not default.

Positive Predictive Value

Positive Predictive Value indicates the accuracy of the model when it predicted that a customer would default. The PPV score of 67.8% demonstrates that the model was 67.8% accurate when it predicted that the customer would default.

Negative Predictive Value

Negative Predictive Value describes the model's precision when it forecasted that a customer would not default. As depicted in the model, an NPV score of 78.3% illustrates that the model was 78.3% accurate when it predicted that the client would not default.

sross15/bestclassifier documentation built on May 23, 2019, 7:19 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

sross15/bestclassifier
Identify Best Binary Classification Model

In sross15/bestclassifier: Identify Best Binary Classification Model

Identify the best machine learning binary classification model

Parameters in the best.classifier function

data

form

p

method

number

repeats

tuneLength

positive

model

set_seed

subset_train

desired_metric

Data: CCD_data

Interpreting the output generated by best.classifier

Understanding the Bar Graph

Analyzing the Confusion Matrix

Accuracy

Sensitivity

Specificity

Positive Predictive Value

Negative Predictive Value

R Package Documentation

Browse R Packages

We want your feedback!

sross15/bestclassifier Identify Best Binary Classification Model

In sross15/bestclassifier: Identify Best Binary Classification Model

Identify the best machine learning binary classification model

Parameters in the best.classifier function

data

form

p

method

number

repeats

tuneLength

positive

model

set_seed

subset_train

desired_metric

Data: CCD_data

Interpreting the output generated by best.classifier

Understanding the Bar Graph

Analyzing the Confusion Matrix

Accuracy

Sensitivity

Specificity

Positive Predictive Value

Negative Predictive Value

R Package Documentation

Browse R Packages

We want your feedback!

sross15/bestclassifier
Identify Best Binary Classification Model