An S3 Class object that creates subset of the testing set labels and calculate various discrimination metric, either built-in or user-defined, maintaining a consistent sample size for all metrics and bin frequencies.
Use the following code to install the package from Github. You will need the devtools package:
devtools::install_github("ArjunChattoraj10/imbDis")
In the world of Machine Learning, many applications rely on Binary Classification algorithms, which answer questions such as, “Is that symbol on the road a stop sign?”, “Will the company make a profit or a loss this quarter?”, or “Is the patient at risk of a heart attack?”. Logistic Regression, Support Vector Machines, Decision Trees, Random Forests and Neural Networks are a few examples of Binary Classifiers, and their effectiveness is measured via various discrimination metrics, such as the ROC AUC, Brier Score, Cross-Entropy loss, etc.
However, one of the issues that exists is a lack of tools that measure the quality of the model for different populations. That is when the proportion of the outcome classes are different from the dataset that the model was trained on.
imbDis is a solution to that. It varies the class imbalance and calculates discrimination metrics for each imbalance frequency. The discrimination metrics may be a pre-defined method within imbDis
, or can be other functions from any package or user-defined.
To build the class, use the constructor function imbDis()
. This function has 4 arguments:
labels
: A vector of the ground-truth class labels.pred
: A vector of values corresponding to predictions. These values should be probabilities, but may be predicted labels as well.case
: A value present in the labels
argument, that denotes the case class label.bin
: A vector of values (larger than 0 and smaller than 1) that represent the different relative frequencies of the case class. It has a default value of seq(0.05, 0.5, 0.05)
which is a vector of values from 0.05 (5%) to 0.5 (50%) separated by 0.05.To define the object, run the following command:
imbDis(labels, pred, case, bin)
imbDis performs some additional calculations in the background to obtain values that are helpful during the metric calculations. These values can be accessed using the $
notation or the [["..."]]
notation.
These additional values are:
control
: This is the control label present in the labels
.labs_01
: This is a conversion of the provided labels
to a 0-1 vector, using the case
argument.sample.size
: This value is a consistent sample size for all metrics and bin frequencies.bin.caseSizes
: The number of cases that can be attributed to each bin frequency.After the imbDis
-type object is created, you can calculate the associated metrics. Built-in methods include auc
, brier
and logLoss
. Each of these simply take in the imbDis
object as the argument and return a data.frame
with the columns: bins, metric, number of samples.
There is also a method manualMetric
which takes in as arguments the imbDis
object and also a function in the format: f(labels, preds)
. It is present to allows more freedom to perform different types of analyses. As long as you maintain that format, the manualMetric
method will perform as expected. The manualMetric
can also let you use predicted labels instead of probabilities.
Here is a simple example using the mtcars
dataset from Base R.
We will start by defining the model, which in this case is a Logistic Regression model. The vs
variable is binary, so we can use that as response. After we train the model, we can obtain predict. Here we are predicting on the same dataset we trained on, which is equivalent to calculating the model's fitted values.
LR = glm(vs ~ mpg + hp + wt, data = mtcars, family = "binomial")
preds = predict(LR, mtcars, type = "response")
Now that we have the model and the prediction, we can define an imbDis
object. We provide the following arguments:
mtcars$vs
preds
1
seq(0.1, 0.9, 0.1)
imbD = imbDis(mtcars$vs, preds, 1, seq(0.1, 0.9, 0.1))
Now we can calculate the metrics for this object. Let's calculate the Brier Score here:
brier(imbD)
The output is below. There is randomization involved so values may not be the same as below:
bins brier n_samps
1 0.1 0.11136114 16
2 0.2 0.06845671 16
3 0.3 0.04766746 16
4 0.4 0.12044719 16
5 0.5 0.04166886 16
6 0.6 0.08927128 16
7 0.7 0.05264758 16
8 0.8 0.08150856 16
9 0.9 0.05096148 16
More examples are within the utils directory of this repository. Within it, you will find the folder called examples. In examples, there are multiple .R files that simulate data, split into training/testing sets, train models and predict results, which are then evaluated using imbDis.
The files present are:
data_simulate.R: This file simulates data from a logistic regression model. There are 10 covariates: some are linears, some have interactions and others are polynomial. Some plots are present to show how the data is distributed. The simulated data is saved in the data folder in utils under the name simulated_data.csv, a comma-separated data file.
train_test.R: A file that reads in the simulated data and splits it such that 70% of it is the training and 30% is the testing set.
logistic_regression.R, decision_tree.R & random_forest.R: These three files train Logistic Regression models, Decision Tree models and Random Forest models respectively. Each file trains multiple models using different subsets of the covariates. The prediction results from each model is saved in comma-separated value files within the data directory: res_LR.R, res_DT.R and res_RF.R.
Since Decision Trees and Random Forests determine interaction and non-linearity via their algorithms, the files using those have two distinct models:
However, Logistic Regression requires specification of non-linear terms. So, its file has three different models:
The last model is the model specification used to simulate the data. Thus, it is expected to perform the best out of all 7 models.
Package is under Apache License, Version 2.0. Refer to LICENSE.txt for additional information.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.