In choonghyunryu/alookr: Model Classifier for Binary Classification

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-"
)

alookr

Overview

Binary classification modeling with alookr.

Features:

Clean and split data sets to train and test.
Create several representative models.
Evaluate the performance of the model to select the best model.
Support the entire process of developing a binary classification model.

The name alookr comes from looking at the analytics process in the data analysis process.

Install alookr

The released version is available on CRAN. but not yet.

install.packages("alookr")

Or you can get the development version without vignettes from GitHub:

devtools::install_github("choonghyunryu/alookr")

Or you can get the development version with vignettes from GitHub:

install.packages(c("ISLR", "spelling", "mlbench"))
devtools::install_github("choonghyunryu/alookr", build_vignettes = TRUE)

Usage

alookr includes several vignette files, which we use throughout the documentation.

Provided vignettes is as follows.

Cleansing the dataset
Split the data into a train set and a test set
Modeling and Evaluate, Predict

browseVignettes(package = "alookr")

Cleansing the dataset

Data: create example dataset

To illustrate basic use of the alookr package, create the data_exam with sample function. The data_exam dataset include 5 variables.

variables are as follows.:

id : character
year: character
count: numeric
alpha : character
flag : character

# create sample dataset
set.seed(123L)
id <- sapply(1:1000, function(x)
  paste(c(sample(letters, 5), x), collapse = ""))

year <- "2018"

set.seed(123L)
count <- sample(1:10, size = 1000, replace = TRUE)

set.seed(123L)
alpha <- sample(letters, size = 1000, replace = TRUE)

set.seed(123L)
flag <- sample(c("Y", "N"), size = 1000, prob = c(0.1, 0.9), replace = TRUE)

data_exam <- data.frame(id, year, count, alpha, flag, stringsAsFactors = FALSE)

# structure of dataset
str(data_exam)

# summary of dataset
summary(data_exam)

Clean dataset

cleanse() cleans up the dataset before fitting the classification model.

The function of cleanse() is as follows.:

remove variables whose unique value is one
remove variables with high unique rate
converts character variables to factor
remove variables with missing value

Cleanse dataset with `cleanse()`

For example, we can cleanse all variables in data_exam:

library(alookr)

# cleansing dataset
newDat <- cleanse(data_exam)

# structure of cleansing dataset
str(newDat)

remove variables whose unique value is one : The year variable has only one value, "2018". Not needed when fitting the model. So it was removed.
remove variables with high unique rate : If the number of levels of categorical data is very large, it is not suitable for classification model. In this case, it is highly likely to be an identifier of the data. So, remove the categorical (or character) variable with a high value of the unique rate defined as "number of levels / number of observations".
- The unique rate of the id variable with the number of levels of 1000 is 1. This variable is the object of the removal by identifier.
- The unique rate of the alpha variable is 0.026 and this variable is also removed.
converts character variables to factor : The character type flag variable is converted to a factor type.

For example, we can not remove the categorical data that is removed by changing the threshold of the unique rate:

# cleansing dataset
newDat <- cleanse(data_exam, uniq_thres = 0.03)

# structure of cleansing dataset
str(newDat)

The alpha variable was not removed.

If you do not want to apply a unique rate, you can set the value of the uniq argument to FALSE.:

# cleansing dataset
newDat <- cleanse(data_exam, uniq = FALSE)

# structure of cleansing dataset
str(newDat)

If you do not want to force type conversion of a character variable to factor, you can set the value of the char argument to FALSE.:

# cleansing dataset
newDat <- cleanse(data_exam, char = FALSE)

# structure of cleansing dataset
str(newDat)

If you want to remove a variable that contains missing values, specify the value of the missing argument as TRUE. The following example removes the flag variable that contains the missing value.

data_exam$flag[1] <- NA 

# cleansing dataset
newDat <- cleanse(data_exam, missing = TRUE)

# structure of cleansing dataset
str(newDat)

Diagnosis and removal of highly correlated variables

In the linear model, there is a multicollinearity if there is a strong correlation between independent variables. So it is better to remove one variable from a pair of variables where the correlation exists.

Even if it is not a linear model, removing one variable from a strongly correlated pair of variables can also reduce the overhead of the operation. It is also easy to interpret the model.

Cleanse dataset with `treatment_corr()`

treatment_corr() diagnose pairs of highly correlated variables or remove on of them.

treatment_corr() calculates correlation coefficient of pearson for numerical variable, and correlation coefficient of spearman for categorical variable.

For example, we can diagnosis and removal of highly correlated variables:

# numerical variable
x1 <- 1:100
set.seed(12L)
x2 <- sample(1:3, size = 100, replace = TRUE) * x1 + rnorm(1)
set.seed(1234L)
x3 <- sample(1:2, size = 100, replace = TRUE) * x1 + rnorm(1)

# categorical variable
x4 <- factor(rep(letters[1:20], time = 5))
set.seed(100L)
x5 <- factor(rep(letters[1:20 + sample(1:6, size = 20, replace = TRUE)], time = 5))
set.seed(200L)
x6 <- factor(rep(letters[1:20 + sample(1:3, size = 20, replace = TRUE)], time = 5))
set.seed(300L)
x7 <- factor(sample(letters[1:5], size = 100, replace = TRUE))

exam <- data.frame(x1, x2, x3, x4, x5, x6, x7)
str(exam)
head(exam)

# default case
exam_01 <- treatment_corr(exam)
head(exam_01)

# not removing variables
treatment_corr(exam, treat = FALSE)

# Set a threshold to detecting variables when correlation greater then 0.9
treatment_corr(exam, corr_thres = 0.9, treat = FALSE)

# not verbose mode
exam_02 <- treatment_corr(exam, verbose = FALSE)
head(exam_02)

remove variables whose strong correlation : x1, x4, x5 are removed.

Split the data into a train set and a test set

Data: Credit Card Default Data

Default of ISLR package is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.

A data frame with 10000 observations on the following 4 variables.:

default : factor. A factor with levels No and Yes indicating whether the customer defaulted on their debt
student: factor. A factor with levels No and Yes indicating whether the customer is a student
balance: numeric. The average balance that the customer has remaining on their credit card after making their monthly payment
income : numeric. Income of customer

# Credit Card Default Data
head(ISLR::Default)

# structure of dataset
str(ISLR::Default)

# summary of dataset
summary(ISLR::Default)

Split dataset

split_by() splits the data.frame or tbl_df into a training set and a test set.

Split dataset with `split_by()`

The split_df class is created, which contains the split information and criteria to separate the training and the test set.

library(alookr)
library(dplyr)

# Generate data for the example
sb <- ISLR::Default %>%
  split_by(default, seed = 6534)

sb

The attributes of the split_df class are as follows.:

split_seed : integer. random seed used for splitting
target : character. the name of the target variable
binary : logical. whether the target variable is binary class
minority : character. the name of the minority class
majority : character. the name of the majority class
minority_rate : numeric. the rate of the minority class
majority_rate : numeric. the rate of the majority class

attr_names <- names(attributes(sb))
attr_names

sb_attr <- attributes(sb)

# The third property, row.names, is excluded from the output because its length is very long.
sb_attr[!attr_names %in% "row.names"]

summary() summarizes the information of two datasets splitted by split_by().

summary(sb)

Compare dataset

Train data and test data should be similar. If the two datasets are not similar, the performance of the predictive model may be reduced.

alookr provides a function to compare the similarity between train dataset and test dataset.

If the two data sets are not similar, the train dataset and test dataset should be splitted again from the original data.

Comparison of categorical variables with `compare_target_category()`

Compare the statistics of the categorical variables of the train set and test set included in the "split_df" class.

sb %>%
  compare_target_category()

# compare variables that are character data types.
sb %>%
  compare_target_category(add_character = TRUE)

# display marginal
sb %>%
  compare_target_category(margin = TRUE)

# student variable only
sb %>%
  compare_target_category(student)

sb %>%
  compare_target_category(student, margin = TRUE)

compare_target_category() returns tbl_df, where the variables have the following.:

variable : character. categorical variable name
level : factor. level of categorical variables
train : numeric. the relative frequency of the level in the train set
test : numeric. the relative frequency of the level in the test set
abs_diff : numeric. the absolute value of the difference between two relative frequencies

Comparison of numeric variables with `compare_target_numeric()`

Compare the statistics of the numerical variables of the train set and test set included in the "split_df" class.

sb %>%
  compare_target_numeric()

# balance variable only
sb %>%
  compare_target_numeric(balance)

compare_target_numeric() returns tbl_df, where the variables have the following.:

variable : character. numeric variable name
train_mean : numeric. arithmetic mean of train set
test_mean : numeric. arithmetic mean of test set
train_sd : numeric. standard deviation of train set
test_sd : numeric. standard deviation of test set
train_z : numeric. the arithmetic mean of the train set divided by the standard deviation
test_z : numeric. the arithmetic mean of the test set divided by the standard deviation

Comparison plot with `compare_plot()`

Plot compare information of the train set and test set included in the "split_df" class.

# income variable only
sb %>%
  compare_plot("income")

# all varibales
sb %>%
  compare_plot()

Diagnosis of train set and test set with `compare_diag()`

Diagnosis of similarity between datasets splitted by train set and set included in the "split_df" class.

defaults <- ISLR::Default
defaults$id <- seq(NROW(defaults))

set.seed(1)
defaults[sample(seq(NROW(defaults)), 3), "student"] <- NA
set.seed(2)
defaults[sample(seq(NROW(defaults)), 10), "balance"] <- NA

sb_2 <- defaults %>%
  split_by(default)

sb_2 %>%
  compare_diag()

sb_2 %>%
  compare_diag(add_character = TRUE)

sb_2 %>%
  compare_diag(uniq_thres = 0.0005)

Extract train/test dataset

If you compare the train set with the test set and find that the two datasets are similar, extract the data from the split_df object.

Extract train set or test set with `extract_set()`

Extract train set or test set from split_df class object.

train <- sb %>%
  extract_set(set = "train")

test <- sb %>%
  extract_set(set = "test")

dim(train)

dim(test)

Extract the data to fit the model with `sampling_target()`

In a target class, the ratio of the majority class to the minority class is not similar and the ratio of the minority class is very small, which is called the imbalanced class.

If target variable is an imbalanced class, the characteristics of the majority class are actively reflected in the model. This model implies an error in predicting the minority class as the majority class. So we have to make the train dataset a balanced class.

sampling_target() performs sampling on the train set of split_df to resolve the imbalanced class.

# under-sampling with random seed
under <- sb %>%
  sampling_target(seed = 1234L)

under %>%
  count(default)

# under-sampling with random seed, and minority class frequency is 40%
under40 <- sb %>%
  sampling_target(seed = 1234L, perc = 40)

under40 %>%
  count(default)

# over-sampling with random seed
over <- sb %>%
  sampling_target(method = "ubOver", seed = 1234L)

over %>%
  count(default)

# over-sampling with random seed, and k = 10
over10 <- sb %>%
  sampling_target(method = "ubOver", seed = 1234L, k = 10)

over10 %>%
  count(default)

# SMOTE with random seed
smote <- sb %>%
  sampling_target(method = "ubSMOTE", seed = 1234L)

smote %>%
  count(default)

# SMOTE with random seed, and perc.under = 250
smote250 <- sb %>%
  sampling_target(method = "ubSMOTE", seed = 1234L, perc.under = 250)

smote250 %>%
  count(default)

The argument that specifies the sampling method in sampling_target () is method. "ubUnder" is under-sampling, and "ubOver" is over-sampling, "ubSMOTE" is SMOTE(Synthetic Minority Over-sampling TEchnique).

Modeling and Evaluate, Predict

Data: Wisconsin Breast Cancer Data

BreastCancer of mlbench package is a breast cancer data. The objective is to identify each of a number of benign or malignant classes.

A data frame with 699 observations on 11 variables, one being a character variable, 9 being ordered or nominal, and 1 target class.:

Id : character. Sample code number
Cl.thickness : ordered factor. Clump Thickness
Cell.size : ordered factor. Uniformity of Cell Size
Cell.shape : ordered factor. Uniformity of Cell Shape
Marg.adhesion : ordered factor. Marginal Adhesion
Epith.c.size : ordered factor. Single Epithelial Cell Size
Bare.nuclei : factor. Bare Nuclei
Bl.cromatin : factor. Bland Chromatin
Normal.nucleoli : factor. Normal Nucleoli
Mitoses : factor. Mitoses
Class : factor. Class. level is benign and malignant.

library(mlbench)
data(BreastCancer)

# class of each variables
sapply(BreastCancer, function(x) class(x)[1])

Preperation the data

Perform data preprocessing as follows.:

Find and imputate variables that contain missing values.
Split the data into a train set and a test set.
To solve the imbalanced class, perform sampling in the train set of raw data.
Cleansing the dataset for classification modeling.

Fix the missing value with `dlookr::imputate_na()`

find the variables that include missing value. and imputate the missing value using imputate_na() in dlookr package.

library(dlookr)
library(dplyr)

# variable that have a missing value
diagnose(BreastCancer) %>%
  filter(missing_count > 0)

# imputation of missing value
breastCancer <- BreastCancer %>%
  mutate(Bare.nuclei = imputate_na(BreastCancer, Bare.nuclei, Class,
                         method = "mice", no_attrs = TRUE, print_flag = FALSE))

Split data set

Splits the dataset into a train set and a test set with `split_by()`

split_by() in the alookr package splits the dataset into a train set and a test set.

The ratio argument of the split_by() function specifies the ratio of the train set.

split_by() creates a class object named split_df.

library(alookr)

# split the data into a train set and a test set by default arguments
sb <- breastCancer %>%
  split_by(target = Class)

# show the class name
class(sb)

# split the data into a train set and a test set by ratio = 0.6
tmp <- breastCancer %>%
  split_by(Class, ratio = 0.6)

The summary() function displays the following useful information about the split_df object:

random seed : The random seed is the random seed used internally to separate the data
split data : Information of splited data
- train set count : number of train set
- test set count : number of test set
target variable : Target variable name
- minority class : name and ratio(In parentheses) of minority class
- majority class : name and ratio(In parentheses) of majority class

# summary() display the some information
summary(sb)

# summary() display the some information
summary(tmp)

Check missing levels in the train set

In the case of categorical variables, when a train set and a test set are separated, a specific level may be missing from the train set.

In this case, there is no problem when fitting the model, but an error occurs when predicting with the model you created. Therefore, preprocessing is performed to avoid missing data preprocessing.

In the following example, fortunately, there is no categorical variable that contains the missing levels in the train set.

# list of categorical variables in the train set that contain missing levels
nolevel_in_train <- sb %>%
  compare_target_category() %>% 
  filter(is.na(train)) %>% 
  select(variable) %>% 
  unique() %>% 
  pull

nolevel_in_train

# if any of the categorical variables in the train set contain a missing level, 
# split them again.
while (length(nolevel_in_train) > 0) {
  sb <- breastCancer %>%
    split_by(Class)

  nolevel_in_train <- sb %>%
    compare_target_category() %>% 
    filter(is.na(train)) %>% 
    select(variable) %>% 
    unique() %>% 
    pull
}

Handling the imbalanced classes data with `sampling_target()`

Issue of imbalanced classes data

Imbalanced classes(levels) data means that the number of one level of the frequency of the target variable is relatively small. In general, the proportion of positive classes is relatively small. For example, in the model of predicting spam, the class of interest spam is less than non-spam.

Imbalanced classes data is a common problem in machine learning classification.

table() and prop.table() are traditionally useful functions for diagnosing imbalanced classes data. However, alookr's summary() is simpler and provides more information.

# train set frequency table - imbalanced classes data
table(sb$Class)

# train set relative frequency table - imbalanced classes data
prop.table(table(sb$Class))

# using summary function - imbalanced classes data
summary(sb)

Handling the imbalanced classes data

Most machine learning algorithms work best when the number of samples in each class are about equal. And most algorithms are designed to maximize accuracy and reduce error. So, we requre handling an imbalanced class problem.

sampling_target() performs sampling to solve an imbalanced classes data problem.

Resampling - oversample minority class

Oversampling can be defined as adding more copies of the minority class.

Oversampling is performed by specifying "ubOver" in the method argument of the sampling_target() function.

# to balanced by over sampling
train_over <- sb %>%
  sampling_target(method = "ubOver")

# frequency table 
table(train_over$Class)

Resampling - undersample majority class

Undersampling can be defined as removing some observations of the majority class.

Undersampling is performed by specifying "ubUnder" in the method argument of the sampling_target() function.

# to balanced by under sampling
train_under <- sb %>%
  sampling_target(method = "ubUnder")

# frequency table 
table(train_under$Class)

Generate synthetic samples - SMOTE

SMOTE(Synthetic Minority Oversampling Technique) uses a nearest neighbors algorithm to generate new and synthetic data.

SMOTE is performed by specifying "ubSMOTE" in the method argument of the sampling_target() function.

# to balanced by SMOTE
train_smote <- sb %>%
  sampling_target(seed = 1234L, method = "ubSMOTE")

# frequency table 
table(train_smote$Class)

Cleansing the dataset for classification modeling with `cleanse()`

The cleanse() cleanse the dataset for classification modeling.

This function is useful when fit the classification model. This function does the following.:

Remove the variable with only one value.
And remove variables that have a unique number of values relative to the number of observations for a character or categorical variable.
- In this case, it is a variable that corresponds to an identifier or an identifier.
And converts the character to factor.

In this example, The cleanse() function removed a variable ID with a high unique rate.

# clean the training set
train <- train_smote %>%
  cleanse

Extract test set for evaluation of the model with `extract_set()`

# extract test set
test <- sb %>%
  extract_set(set = "test")

Binary classification modeling with `run_models()`

run_models() performs some representative binary classification modeling using split_df object created by split_by().

run_models() executes the process in parallel when fitting the model. However, it is not supported in MS-Windows operating system and RStudio environment.

Currently supported algorithms are as follows.:

logistic : logistic regression using stats package
rpart : Recursive Partitioning Trees using rpart package
ctree : Conditional Inference Trees using party package
randomForest :Classification with Random Forest using randomForest package
ranger : A Fast Implementation of Random Forests using ranger package

run_models() returns a model_df class object.

The model_df class object contains the following variables.:

step : character. The current stage in the classification modeling process.
- For objects created with run_models(), the value of the variable is "1.Fitted".
model_id : model identifiers
target : name of target variable
positive : positive class in target variable
fitted_model : list. Fitted model object by model_id's algorithms

result <- train %>% 
  run_models(target = "Class", positive = "malignant")
result

Evaluate the model

Evaluate the predictive performance of fitted models.

Predict test set using fitted model with `run_predict()`

run_predict() predict the test set using model_df class fitted by run_models().

run_predict () is executed in parallel when predicting by model. However, it is not supported in MS-Windows operating system and RStudio environment.

The model_df class object contains the following variables.:

step : character. The current stage in the classification modeling process.
- For objects created with run_predict(), the value of the variable is "2.Predicted".
model_id : character. Type of fit model.
target : character. Name of target variable.
positive : character. Level of positive class of binary classification.
fitted_model : list. Fitted model object by model_id's algorithms.
predicted : result of predcit by each models

pred <- result %>%
  run_predict(test)
pred

Calculate the performance metric with `run_performance()`

run_performance() calculate the performance metric of model_df class predicted by run_predict().

run_performance () is performed in parallel when calculating the performance evaluation index. However, it is not supported in MS-Windows operating system and RStudio environment.

The model_df class object contains the following variables.:

step : character. The current stage in the classification modeling process.
- For objects created with run_performance(), the value of the variable is "3.Performanced".
model_id : character. Type of fit model.
target : character. Name of target variable.
positive : character. Level of positive class of binary classification.
fitted_model : list. Fitted model object by model_id's algorithms
predicted : list. Predicted value by individual model. Each value has a predict_class class object.
performance : list. Calculate metrics by individual model. Each value has a numeric vector.

# Calculate performace metrics.
perf <- run_performance(pred)
perf

The performance variable contains a list object, which contains 15 performance metrics:

ZeroOneLoss : Normalized Zero-One Loss(Classification Error Loss).
Accuracy : Accuracy.
Precision : Precision.
Recall : Recall.
Sensitivity : Sensitivity.
Specificity : Specificity.
F1_Score : F1 Score.
Fbeta_Score : F-Beta Score.
LogLoss : Log loss / Cross-Entropy Loss.
AUC : Area Under the Receiver Operating Characteristic Curve (ROC AUC).
Gini : Gini Coefficient.
PRAUC : Area Under the Precision-Recall Curve (PR AUC).
LiftAUC : Area Under the Lift Chart.
GainAUC : Area Under the Gain Chart.
KS_Stat : Kolmogorov-Smirnov Statistic.

# Performance by analytics models
performance <- perf$performance
names(performance) <- perf$model_id
performance

If you change the list object to tidy format, you'll see the following at a glance:

# Convert to matrix for compare performace.
sapply(performance, "c")

compare_performance() return a list object(results of compared model performance). and list has the following components:

recommend_model : character. The name of the model that is recommended as the best among the various models.
top_count : numeric. The number of best performing performance metrics by model.
mean_rank : numeric. Average of ranking individual performance metrics by model.
top_metric : list. The name of the performance metric with the best performance on individual performance metrics by model.

In this example, compare_performance() recommend the "ranger" model.

# Compaire the Performance metrics of each model
comp_perf <- compare_performance(pred)
comp_perf

Plot the ROC curve with `plot_performance()`

compare_performance() plot ROC curve.

# Plot ROC curve
plot_performance(pred)

Tunning the cut-off

In general, if the prediction probability is greater than 0.5 in the binary classification model, it is predicted as positive class. In other words, 0.5 is used for the cut-off value. This applies to most model algorithms. However, in some cases, the performance can be tuned by changing the cut-off value.

plot_cutoff () visualizes a plot to select the cut-off value, and returns the cut-off value.

pred_best <- pred %>% 
  filter(model_id == comp_perf$recommend_model) %>% 
  select(predicted) %>% 
  pull %>% 
  .[[1]] %>% 
  attr("pred_prob")

cutoff <- plot_cutoff(pred_best, test$Class, "malignant", type = "mcc")
cutoff

cutoff2 <- plot_cutoff(pred_best, test$Class, "malignant", type = "density")
cutoff2

cutoff3 <- plot_cutoff(pred_best, test$Class, "malignant", type = "prob")
cutoff3

Performance comparison between prediction and tuned cut-off with `performance_metric()`

Compare the performance of the original prediction with that of the tuned cut-off. Compare the cut-off with the non-cut model for the model with the best performance comp_perf$recommend_model.

comp_perf$recommend_model

# extract predicted probability
idx <- which(pred$model_id == comp_perf$recommend_model)
pred_prob <- attr(pred$predicted[[idx]], "pred_prob")

# or, extract predicted probability using dplyr
pred_prob <- pred %>% 
  filter(model_id == comp_perf$recommend_model) %>% 
  select(predicted) %>% 
  pull %>% 
  "[["(1) %>% 
  attr("pred_prob")

# predicted probability
pred_prob  

# compaire Accuracy
performance_metric(pred_prob, test$Class, "malignant", "Accuracy")
performance_metric(pred_prob, test$Class, "malignant", "Accuracy",
                   cutoff = cutoff)

# compaire Confusion Matrix
performance_metric(pred_prob, test$Class, "malignant", "ConfusionMatrix")
performance_metric(pred_prob, test$Class, "malignant", "ConfusionMatrix", 
                   cutoff = cutoff)

# compaire F1 Score
performance_metric(pred_prob, test$Class, "malignant", "F1_Score")
performance_metric(pred_prob, test$Class,  "malignant", "F1_Score", 
                   cutoff = cutoff)
performance_metric(pred_prob, test$Class,  "malignant", "F1_Score", 
                   cutoff = cutoff2)

If the performance of the tuned cut-off is good, use it as a cut-off to predict positives.

Predict

If you have selected a good model from several models, then perform the prediction with that model.

Create data set for predict

Create sample data for predicting by extracting 100 samples from the data set used in the previous under sampling example.

data_pred <- train_under %>% 
  cleanse 

set.seed(1234L)
data_pred <- data_pred %>% 
  nrow %>% 
  seq %>% 
  sample(size = 50) %>% 
  data_pred[., ]

Predict with alookr and dplyr

Do a predict using the dplyr package. The last factor() function eliminates unnecessary information.

pred_actual <- pred %>%
  filter(model_id == comp_perf$recommend_model) %>% 
  run_predict(data_pred) %>% 
  select(predicted) %>% 
  pull %>% 
  "[["(1) %>% 
  factor()

pred_actual

If you want to predict by cut-off, specify the cutoff argument in the run_predict() function as follows.:

In the example, there is no difference between the results of using cut-off and not.

pred_actual2 <- pred %>%
  filter(model_id == comp_perf$recommend_model) %>% 
  run_predict(data_pred, cutoff) %>% 
  select(predicted) %>% 
  pull %>% 
  "[["(1) %>% 
  factor()

pred_actual2

sum(pred_actual != pred_actual2)

choonghyunryu/alookr documentation built on Feb. 28, 2024, 3:04 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

choonghyunryu/alookr Model Classifier for Binary Classification

In choonghyunryu/alookr: Model Classifier for Binary Classification

alookr

Overview

Install alookr

Usage

Cleansing the dataset

Data: create example dataset

Clean dataset

Cleanse dataset with cleanse()

Diagnosis and removal of highly correlated variables

Cleanse dataset with treatment_corr()

Split the data into a train set and a test set

Data: Credit Card Default Data

Split dataset

Split dataset with split_by()

Compare dataset

Comparison of categorical variables with compare_target_category()

Comparison of numeric variables with compare_target_numeric()

Comparison plot with compare_plot()

Diagnosis of train set and test set with compare_diag()

Extract train/test dataset

Extract train set or test set with extract_set()

Extract the data to fit the model with sampling_target()

Modeling and Evaluate, Predict

Data: Wisconsin Breast Cancer Data

Preperation the data

Fix the missing value with dlookr::imputate_na()

Split data set

Splits the dataset into a train set and a test set with split_by()

Check missing levels in the train set

Handling the imbalanced classes data with sampling_target()

Issue of imbalanced classes data

Handling the imbalanced classes data

Resampling - oversample minority class

Resampling - undersample majority class

Generate synthetic samples - SMOTE

Cleansing the dataset for classification modeling with cleanse()

Extract test set for evaluation of the model with extract_set()

Binary classification modeling with run_models()

Evaluate the model

Predict test set using fitted model with run_predict()

Calculate the performance metric with run_performance()

Plot the ROC curve with plot_performance()

Tunning the cut-off

Performance comparison between prediction and tuned cut-off with performance_metric()

Predict

Create data set for predict

Predict with alookr and dplyr

R Package Documentation

Browse R Packages

We want your feedback!

choonghyunryu/alookr
Model Classifier for Binary Classification

Cleanse dataset with `cleanse()`

Cleanse dataset with `treatment_corr()`

Split dataset with `split_by()`

Comparison of categorical variables with `compare_target_category()`

Comparison of numeric variables with `compare_target_numeric()`

Comparison plot with `compare_plot()`

Diagnosis of train set and test set with `compare_diag()`

Extract train set or test set with `extract_set()`

Extract the data to fit the model with `sampling_target()`

Fix the missing value with `dlookr::imputate_na()`

Splits the dataset into a train set and a test set with `split_by()`

Handling the imbalanced classes data with `sampling_target()`

Cleansing the dataset for classification modeling with `cleanse()`

Extract test set for evaluation of the model with `extract_set()`

Binary classification modeling with `run_models()`

Predict test set using fitted model with `run_predict()`

Calculate the performance metric with `run_performance()`

Plot the ROC curve with `plot_performance()`

Performance comparison between prediction and tuned cut-off with `performance_metric()`