knitr::opts_chunk $ set( collapse = TRUE, comment = " ", fig.width = 7, fig.height = 7, fig.align = "center" )
library( liver ) library( pROC ) library( ggplot2 )
liver package contains a collection of helper functions that make various techniques from data science more user-friendly for non-experts.
Here is an example to show how to use the functionality of the package by using the churn dataset which is available in the package.
data( churn ) str( churn )
It shows that the 'churn' dataset as a
r ncol( churn ) variables and
r nrow( churn ) observations.
We partition the churn dataset randomly into two groups: train set (80%) and test set (20%). Here, we use the
partition function from the liver package:
set.seed( 5 ) data_sets = partition( data = churn, prob = c( 0.8, 0.2 ) ) train_set = data_sets $ part1 test_set = data_sets $ part2 actual_test = test_set $ churn
The churn dataset has
r ncol( churn ) - 1 predictors along with the target variable
churn. Here we use the following predictors:
First, based on the above predictors, find the k-nearest neighbor for the test set, based on the training dataset, for the k = 8 as follows
formula = churn ~ account.length + voice.plan + voice.messages + intl.plan + intl.mins + day.mins + eve.mins + night.mins + customer.calls predict_knn = kNN( formula, train = train_set, test = test_set, k = 8 )
To report Confusion Matrix:
conf.mat( predict_knn, actual_test ) conf.mat.plot( predict_knn, actual_test )
To report Mean Squared Error (MSE):
mse( predict_knn, actual_test )
The predictors that we used in the previous part, do not have the same scale. For example, variable
day.mins change between
r min( churn $ day.mins ) and
r max( churn $ day.mins ), whereas variable
voice.plan is binary. In this case, the values of variable
day.mins will overwhelm the contribution of
voice.plan. To avoid this situation we use normalization. So, we use min-max normalization and transfer the predictors as follows:
predict_knn_trans = kNN( formula, train = train_set, test = test_set, k = 8, transform = "minmax" )
To report Confusion Matrix:
conf.mat.plot( predict_knn_trans, actual_test ) conf.mat.plot( predict_knn, actual_test )
To report the ROC curve, we need the probability of our classification prediction. We can have it by using:
prob_knn = kNN( formula, train = train_set, test = test_set, k = 8, type = "prob" )[ , 1 ] prob_knn_trans = kNN( formula, train = train_set, test = test_set, transform = "minmax", k = 8, type = "prob" )[ , 1 ]
To visualize the model performance between the raw data and the transformed data, we could report the ROC curve plot as well as AUC (Area Under the Curve) by using the
plot.roc function from the pROC package:
roc_knn = roc( actual_test, prob_knn ) roc_knn_trans = roc( actual_test, prob_knn_trans ) ggroc( list( roc_knn, roc_knn_trans ), size = 0.8 ) + theme_minimal() + ggtitle( "ROC plots with AUC") + scale_color_manual( values = c( "red", "blue" ), labels = c( paste( "AUC=", round( auc( roc_knn ), 3 ), "; Raw data; " ), paste( "AUC=", round( auc( roc_knn_trans ), 3 ), "; Transformed data" ) ) ) + theme( legend.title = element_blank() ) + theme( legend.position = c( .7, .3 ), text = element_text( size = 17 ) ) + geom_segment( aes( x = 1, xend = 0, y = 0, yend = 1 ), color = "grey", linetype = "dashed" )
To find out the optimal value of
k based on Error Rate, for the different values of k from 1 to 30, we run the k-nearest neighbor for the test set and compute the Error Rate for these models, by running
kNN.plot( formula, train = train_set, test = test_set, transform = "minmax", k.max = 30, set.seed = 3 )
The plot shows that the minimum value of Error Rate is for the case that k is 13; the smaller values of Error Rate indicates better predictions.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.