knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-" )
Binary classification modeling with alookr
.
Features:
The name alookr
comes from looking at the analytics process
in the data analysis process.
The released version is available on CRAN. but not yet.
install.packages("alookr")
Or you can get the development version without vignettes from GitHub:
devtools::install_github("choonghyunryu/alookr")
Or you can get the development version with vignettes from GitHub:
install.packages(c("ISLR", "spelling", "mlbench")) devtools::install_github("choonghyunryu/alookr", build_vignettes = TRUE)
alookr includes several vignette files, which we use throughout the documentation.
Provided vignettes is as follows.
browseVignettes(package = "alookr")
To illustrate basic use of the alookr package, create the data_exam
with sample function. The data_exam
dataset include 5 variables.
variables are as follows.:
id
: characteryear
: charactercount
: numericalpha
: characterflag
: character# create sample dataset set.seed(123L) id <- sapply(1:1000, function(x) paste(c(sample(letters, 5), x), collapse = "")) year <- "2018" set.seed(123L) count <- sample(1:10, size = 1000, replace = TRUE) set.seed(123L) alpha <- sample(letters, size = 1000, replace = TRUE) set.seed(123L) flag <- sample(c("Y", "N"), size = 1000, prob = c(0.1, 0.9), replace = TRUE) data_exam <- data.frame(id, year, count, alpha, flag, stringsAsFactors = FALSE) # structure of dataset str(data_exam) # summary of dataset summary(data_exam)
cleanse()
cleans up the dataset before fitting the classification model.
The function of cleanse() is as follows.:
cleanse()
For example, we can cleanse all variables in data_exam
:
library(alookr) # cleansing dataset newDat <- cleanse(data_exam) # structure of cleansing dataset str(newDat)
remove variables whose unique value is one
: The year variable has only one value, "2018". Not needed when fitting the model. So it was removed.remove variables with high unique rate
: If the number of levels of categorical data is very large, it is not suitable for classification model. In this case, it is highly likely to be an identifier of the data. So, remove the categorical (or character) variable with a high value of the unique rate defined as "number of levels / number of observations". converts character variables to factor
: The character type flag variable is converted to a factor type.For example, we can not remove the categorical data that is removed by changing the threshold of the unique rate
:
# cleansing dataset newDat <- cleanse(data_exam, uniq_thres = 0.03) # structure of cleansing dataset str(newDat)
The alpha
variable was not removed.
If you do not want to apply a unique rate, you can set the value of the uniq
argument to FALSE.:
# cleansing dataset newDat <- cleanse(data_exam, uniq = FALSE) # structure of cleansing dataset str(newDat)
If you do not want to force type conversion of a character variable to factor, you can set the value of the char
argument to FALSE.:
# cleansing dataset newDat <- cleanse(data_exam, char = FALSE) # structure of cleansing dataset str(newDat)
If you want to remove a variable that contains missing values, specify the value of the missing
argument as TRUE. The following example removes the flag variable that contains the missing value.
data_exam$flag[1] <- NA # cleansing dataset newDat <- cleanse(data_exam, missing = TRUE) # structure of cleansing dataset str(newDat)
In the linear model, there is a multicollinearity if there is a strong correlation between independent variables. So it is better to remove one variable from a pair of variables where the correlation exists.
Even if it is not a linear model, removing one variable from a strongly correlated pair of variables can also reduce the overhead of the operation. It is also easy to interpret the model.
treatment_corr()
treatment_corr()
diagnose pairs of highly correlated variables or remove on of them.
treatment_corr()
calculates correlation coefficient of pearson for numerical variable, and correlation coefficient of spearman for categorical variable.
For example, we can diagnosis and removal of highly correlated variables:
# numerical variable x1 <- 1:100 set.seed(12L) x2 <- sample(1:3, size = 100, replace = TRUE) * x1 + rnorm(1) set.seed(1234L) x3 <- sample(1:2, size = 100, replace = TRUE) * x1 + rnorm(1) # categorical variable x4 <- factor(rep(letters[1:20], time = 5)) set.seed(100L) x5 <- factor(rep(letters[1:20 + sample(1:6, size = 20, replace = TRUE)], time = 5)) set.seed(200L) x6 <- factor(rep(letters[1:20 + sample(1:3, size = 20, replace = TRUE)], time = 5)) set.seed(300L) x7 <- factor(sample(letters[1:5], size = 100, replace = TRUE)) exam <- data.frame(x1, x2, x3, x4, x5, x6, x7) str(exam) head(exam) # default case exam_01 <- treatment_corr(exam) head(exam_01) # not removing variables treatment_corr(exam, treat = FALSE) # Set a threshold to detecting variables when correlation greater then 0.9 treatment_corr(exam, corr_thres = 0.9, treat = FALSE) # not verbose mode exam_02 <- treatment_corr(exam, verbose = FALSE) head(exam_02)
remove variables whose strong correlation
: x1, x4, x5 are removed.Default
of ISLR package
is a simulated data set containing information on ten thousand customers. The aim here is to predict which customers will default on their credit card debt.
A data frame with 10000 observations on the following 4 variables.:
default
: factor. A factor with levels No and Yes indicating whether the customer defaulted on their debtstudent
: factor. A factor with levels No and Yes indicating whether the customer is a studentbalance
: numeric. The average balance that the customer has remaining on their credit card after making their monthly paymentincome
: numeric. Income of customer# Credit Card Default Data head(ISLR::Default) # structure of dataset str(ISLR::Default) # summary of dataset summary(ISLR::Default)
split_by()
splits the data.frame or tbl_df into a training set and a test set.
split_by()
The split_df
class is created, which contains the split information and criteria to separate the training and the test set.
library(alookr) library(dplyr) # Generate data for the example sb <- ISLR::Default %>% split_by(default, seed = 6534) sb
The attributes of the split_df
class are as follows.:
attr_names <- names(attributes(sb)) attr_names sb_attr <- attributes(sb) # The third property, row.names, is excluded from the output because its length is very long. sb_attr[!attr_names %in% "row.names"]
summary()
summarizes the information of two datasets splitted by split_by()
.
summary(sb)
Train data and test data should be similar. If the two datasets are not similar, the performance of the predictive model may be reduced.
alookr
provides a function to compare the similarity between train dataset and test dataset.
If the two data sets are not similar, the train dataset and test dataset should be splitted again from the original data.
compare_target_category()
Compare the statistics of the categorical variables of the train set and test set included in the "split_df" class.
sb %>% compare_target_category() # compare variables that are character data types. sb %>% compare_target_category(add_character = TRUE) # display marginal sb %>% compare_target_category(margin = TRUE) # student variable only sb %>% compare_target_category(student) sb %>% compare_target_category(student, margin = TRUE)
compare_target_category() returns tbl_df, where the variables have the following.:
compare_target_numeric()
Compare the statistics of the numerical variables of the train set and test set included in the "split_df" class.
sb %>% compare_target_numeric() # balance variable only sb %>% compare_target_numeric(balance)
compare_target_numeric() returns tbl_df, where the variables have the following.:
compare_plot()
Plot compare information of the train set and test set included in the "split_df" class.
# income variable only sb %>% compare_plot("income") # all varibales sb %>% compare_plot()
compare_diag()
Diagnosis of similarity between datasets splitted by train set and set included in the "split_df" class.
defaults <- ISLR::Default defaults$id <- seq(NROW(defaults)) set.seed(1) defaults[sample(seq(NROW(defaults)), 3), "student"] <- NA set.seed(2) defaults[sample(seq(NROW(defaults)), 10), "balance"] <- NA sb_2 <- defaults %>% split_by(default) sb_2 %>% compare_diag() sb_2 %>% compare_diag(add_character = TRUE) sb_2 %>% compare_diag(uniq_thres = 0.0005)
If you compare the train set with the test set and find that the two datasets are similar, extract the data from the split_df object.
extract_set()
Extract train set or test set from split_df class object.
train <- sb %>% extract_set(set = "train") test <- sb %>% extract_set(set = "test") dim(train) dim(test)
sampling_target()
In a target class, the ratio of the majority class to the minority class is not similar and the ratio of the minority class is very small, which is called the imbalanced class
.
If target variable is an imbalanced class, the characteristics of the majority class are actively reflected in the model. This model implies an error in predicting the minority class as the majority class. So we have to make the train dataset a balanced class.
sampling_target()
performs sampling on the train set of split_df to resolve the imbalanced class.
# under-sampling with random seed under <- sb %>% sampling_target(seed = 1234L) under %>% count(default) # under-sampling with random seed, and minority class frequency is 40% under40 <- sb %>% sampling_target(seed = 1234L, perc = 40) under40 %>% count(default) # over-sampling with random seed over <- sb %>% sampling_target(method = "ubOver", seed = 1234L) over %>% count(default) # over-sampling with random seed, and k = 10 over10 <- sb %>% sampling_target(method = "ubOver", seed = 1234L, k = 10) over10 %>% count(default) # SMOTE with random seed smote <- sb %>% sampling_target(method = "ubSMOTE", seed = 1234L) smote %>% count(default) # SMOTE with random seed, and perc.under = 250 smote250 <- sb %>% sampling_target(method = "ubSMOTE", seed = 1234L, perc.under = 250) smote250 %>% count(default)
The argument that specifies the sampling method in sampling_target () is method. "ubUnder" is under-sampling, and "ubOver" is over-sampling, "ubSMOTE" is SMOTE(Synthetic Minority Over-sampling TEchnique).
BreastCancer
of mlbench package
is a breast cancer data. The objective is to identify each of a number of benign or malignant classes.
A data frame with 699 observations on 11 variables, one being a character variable, 9 being ordered or nominal, and 1 target class.:
Id
: character. Sample code numberCl.thickness
: ordered factor. Clump ThicknessCell.size
: ordered factor. Uniformity of Cell SizeCell.shape
: ordered factor. Uniformity of Cell ShapeMarg.adhesion
: ordered factor. Marginal AdhesionEpith.c.size
: ordered factor. Single Epithelial Cell SizeBare.nuclei
: factor. Bare NucleiBl.cromatin
: factor. Bland ChromatinNormal.nucleoli
: factor. Normal NucleoliMitoses
: factor. MitosesClass
: factor. Class. level is benign
and malignant
.library(mlbench) data(BreastCancer) # class of each variables sapply(BreastCancer, function(x) class(x)[1])
Perform data preprocessing as follows.:
dlookr::imputate_na()
find the variables that include missing value. and imputate the missing value using imputate_na() in dlookr package.
library(dlookr) library(dplyr) # variable that have a missing value diagnose(BreastCancer) %>% filter(missing_count > 0) # imputation of missing value breastCancer <- BreastCancer %>% mutate(Bare.nuclei = imputate_na(BreastCancer, Bare.nuclei, Class, method = "mice", no_attrs = TRUE, print_flag = FALSE))
split_by()
split_by()
in the alookr package splits the dataset into a train set and a test set.
The ratio argument of the split_by()
function specifies the ratio of the train set.
split_by()
creates a class object named split_df.
library(alookr) # split the data into a train set and a test set by default arguments sb <- breastCancer %>% split_by(target = Class) # show the class name class(sb) # split the data into a train set and a test set by ratio = 0.6 tmp <- breastCancer %>% split_by(Class, ratio = 0.6)
The summary()
function displays the following useful information about the split_df object:
# summary() display the some information summary(sb) # summary() display the some information summary(tmp)
In the case of categorical variables, when a train set and a test set are separated, a specific level may be missing from the train set.
In this case, there is no problem when fitting the model, but an error occurs when predicting with the model you created. Therefore, preprocessing is performed to avoid missing data preprocessing.
In the following example, fortunately, there is no categorical variable that contains the missing levels in the train set.
# list of categorical variables in the train set that contain missing levels nolevel_in_train <- sb %>% compare_target_category() %>% filter(is.na(train)) %>% select(variable) %>% unique() %>% pull nolevel_in_train # if any of the categorical variables in the train set contain a missing level, # split them again. while (length(nolevel_in_train) > 0) { sb <- breastCancer %>% split_by(Class) nolevel_in_train <- sb %>% compare_target_category() %>% filter(is.na(train)) %>% select(variable) %>% unique() %>% pull }
sampling_target()
Imbalanced classes(levels) data means that the number of one level of the frequency of the target variable is relatively small. In general, the proportion of positive classes is relatively small. For example, in the model of predicting spam, the class of interest spam is less than non-spam.
Imbalanced classes data is a common problem in machine learning classification.
table()
and prop.table()
are traditionally useful functions for diagnosing imbalanced classes data. However, alookr's summary()
is simpler and provides more information.
# train set frequency table - imbalanced classes data table(sb$Class) # train set relative frequency table - imbalanced classes data prop.table(table(sb$Class)) # using summary function - imbalanced classes data summary(sb)
Most machine learning algorithms work best when the number of samples in each class are about equal. And most algorithms are designed to maximize accuracy and reduce error. So, we requre handling an imbalanced class problem.
sampling_target() performs sampling to solve an imbalanced classes data problem.
Oversampling can be defined as adding more copies of the minority class.
Oversampling is performed by specifying "ubOver" in the method argument of the sampling_target()
function.
# to balanced by over sampling train_over <- sb %>% sampling_target(method = "ubOver") # frequency table table(train_over$Class)
Undersampling can be defined as removing some observations of the majority class.
Undersampling is performed by specifying "ubUnder" in the method argument of the sampling_target()
function.
# to balanced by under sampling train_under <- sb %>% sampling_target(method = "ubUnder") # frequency table table(train_under$Class)
SMOTE(Synthetic Minority Oversampling Technique) uses a nearest neighbors algorithm to generate new and synthetic data.
SMOTE is performed by specifying "ubSMOTE" in the method argument of the sampling_target()
function.
# to balanced by SMOTE train_smote <- sb %>% sampling_target(seed = 1234L, method = "ubSMOTE") # frequency table table(train_smote$Class)
cleanse()
The cleanse()
cleanse the dataset for classification modeling.
This function is useful when fit the classification model. This function does the following.:
In this example, The cleanse()
function removed a variable ID with a high unique rate.
# clean the training set train <- train_smote %>% cleanse
extract_set()
# extract test set test <- sb %>% extract_set(set = "test")
run_models()
run_models()
performs some representative binary classification modeling using split_df
object created by split_by()
.
run_models()
executes the process in parallel when fitting the model. However, it is not supported in MS-Windows operating system and RStudio environment.
Currently supported algorithms are as follows.:
stats
packagerpart
packageparty
packagerandomForest
packageranger
packagerun_models()
returns a model_df
class object.
The model_df
class object contains the following variables.:
run_models()
, the value of the variable is "1.Fitted".result <- train %>% run_models(target = "Class", positive = "malignant") result
Evaluate the predictive performance of fitted models.
run_predict()
run_predict()
predict the test set using model_df
class fitted by run_models()
.
run_predict ()
is executed in parallel when predicting by model. However, it is not supported in MS-Windows operating system and RStudio environment.
The model_df
class object contains the following variables.:
run_predict()
, the value of the variable is "2.Predicted".pred <- result %>% run_predict(test) pred
run_performance()
run_performance()
calculate the performance metric of model_df
class predicted by run_predict()
.
run_performance ()
is performed in parallel when calculating the performance evaluation index. However, it is not supported in MS-Windows operating system and RStudio environment.
The model_df
class object contains the following variables.:
run_performance()
, the value of the variable is "3.Performanced".# Calculate performace metrics. perf <- run_performance(pred) perf
The performance variable contains a list object, which contains 15 performance metrics:
# Performance by analytics models performance <- perf$performance names(performance) <- perf$model_id performance
If you change the list object to tidy format, you'll see the following at a glance:
# Convert to matrix for compare performace. sapply(performance, "c")
compare_performance()
return a list object(results of compared model performance). and list has the following components:
In this example, compare_performance()
recommend the "ranger" model.
# Compaire the Performance metrics of each model comp_perf <- compare_performance(pred) comp_perf
plot_performance()
compare_performance()
plot ROC curve.
# Plot ROC curve plot_performance(pred)
In general, if the prediction probability is greater than 0.5 in the binary classification model, it is predicted as positive class
.
In other words, 0.5 is used for the cut-off value.
This applies to most model algorithms. However, in some cases, the performance can be tuned by changing the cut-off value.
plot_cutoff ()
visualizes a plot to select the cut-off value, and returns the cut-off value.
pred_best <- pred %>% filter(model_id == comp_perf$recommend_model) %>% select(predicted) %>% pull %>% .[[1]] %>% attr("pred_prob") cutoff <- plot_cutoff(pred_best, test$Class, "malignant", type = "mcc") cutoff cutoff2 <- plot_cutoff(pred_best, test$Class, "malignant", type = "density") cutoff2 cutoff3 <- plot_cutoff(pred_best, test$Class, "malignant", type = "prob") cutoff3
performance_metric()
Compare the performance of the original prediction with that of the tuned cut-off.
Compare the cut-off with the non-cut model for the model with the best performance comp_perf$recommend_model
.
comp_perf$recommend_model # extract predicted probability idx <- which(pred$model_id == comp_perf$recommend_model) pred_prob <- attr(pred$predicted[[idx]], "pred_prob") # or, extract predicted probability using dplyr pred_prob <- pred %>% filter(model_id == comp_perf$recommend_model) %>% select(predicted) %>% pull %>% "[["(1) %>% attr("pred_prob") # predicted probability pred_prob # compaire Accuracy performance_metric(pred_prob, test$Class, "malignant", "Accuracy") performance_metric(pred_prob, test$Class, "malignant", "Accuracy", cutoff = cutoff) # compaire Confusion Matrix performance_metric(pred_prob, test$Class, "malignant", "ConfusionMatrix") performance_metric(pred_prob, test$Class, "malignant", "ConfusionMatrix", cutoff = cutoff) # compaire F1 Score performance_metric(pred_prob, test$Class, "malignant", "F1_Score") performance_metric(pred_prob, test$Class, "malignant", "F1_Score", cutoff = cutoff) performance_metric(pred_prob, test$Class, "malignant", "F1_Score", cutoff = cutoff2)
If the performance of the tuned cut-off is good, use it as a cut-off to predict positives.
If you have selected a good model from several models, then perform the prediction with that model.
Create sample data for predicting by extracting 100 samples from the data set used in the previous under sampling example.
data_pred <- train_under %>% cleanse set.seed(1234L) data_pred <- data_pred %>% nrow %>% seq %>% sample(size = 50) %>% data_pred[., ]
Do a predict using the dplyr
package. The last factor()
function eliminates unnecessary information.
pred_actual <- pred %>% filter(model_id == comp_perf$recommend_model) %>% run_predict(data_pred) %>% select(predicted) %>% pull %>% "[["(1) %>% factor() pred_actual
If you want to predict by cut-off, specify the cutoff
argument in the run_predict()
function as follows.:
In the example, there is no difference between the results of using cut-off and not.
pred_actual2 <- pred %>% filter(model_id == comp_perf$recommend_model) %>% run_predict(data_pred, cutoff) %>% select(predicted) %>% pull %>% "[["(1) %>% factor() pred_actual2 sum(pred_actual != pred_actual2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.