Example for Data Analysis

knitr::opts_chunk$set(echo = TRUE)
library( liver )
library( pROC )     
library( ggplot2 )  

The liver package contains a collection of helper functions that make various techniques from data science more user-friendly for non-experts.

Here is an example to show how to use the functinality of the package by using the churn dataset which is available in the package.

data( churn )        # load the 'churn' dataset

str( churn )

It shows that the 'churn' dataset as a data.frame has r ncol( churn ) variables and r nrow( churn ) observations.

Partitioning the dataset

We partition the churn dataset randomly into two groups: train set (80%) and test set (20%). Here, we use the partition function from the liver package:

set.seed( 5 )

data_sets = partition( data = churn, prob = c( 0.8, 0.2 ) )

train_set = data_sets $ part1
test_set  = data_sets $ part2

actual_test  = test_set $ churn

Classification by kNN algorithm

The churn dataset has r ncol( churn ) - 1 predictors along with the target variable churn. Here we use the following predictors:

account.length, voice.plan, voice.messages, intl.plan, intl.mins, day.mins, eve.mins, night.mins, and customer.calls.

First, based on the above predictors, find the k-nearest neighbor for the test set, based on the training dataset, for the k = 8 as follows

formula = churn ~ account.length + voice.plan + voice.messages + intl.plan + intl.mins + 
                  day.mins + eve.mins + night.mins + customer.calls

predict_knn = kNN( formula, train = train_set, test = test_set, k = 8 )

To report Confusion Matrix:

conf.mat( predict_knn, actual_test )

conf.mat.plot( predict_knn, actual_test )

To report Mean Squared Error (MSE):

mse( predict_knn, actual_test )

Classification by kNN algorithm with data transformation

The predictors that we used in the previous part, do not have the same scale. For example, variable day.mins change between r min( churn $ day.mins ) and r max( churn $ day.mins ), whereas variable voice.plan is binary. In this case, the values of variable day.mins will overwhelm the contribution of voice.plan. To avoid this situation we use normalization. So, we use min-max normalization and transfer the predictors as follows:

predict_knn_trans = kNN( formula, train = train_set, test = test_set, k = 8, transform = "minmax" )

To report Confusion Matrix:

conf.mat.plot( predict_knn_trans, actual_test )
conf.mat.plot( predict_knn, actual_test )

To report the ROC curve, we need the probability of our classification prediction. We can have it by using:

prob_knn = kNN( formula, train = train_set, test = test_set, k = 8, type = "prob" )[ , 1 ]

prob_knn_trans = kNN( formula, train = train_set, test = test_set, transform = "minmax", k = 8, type = "prob" )[ , 1 ]

To visualize the model performance between the raw data and the transformed data, we could report the ROC curve plot as well as AUC (Area Under the Curve) by using the plot.roc function from the pROC package:

roc_knn = roc( actual_test, prob_knn )
roc_knn_trans = roc( actual_test, prob_knn_trans )

ggroc( list( roc_knn, roc_knn_trans ), size = 0.8 ) + 
    theme_minimal() + ggtitle( "ROC plots with AUC") +
  scale_color_manual( values = c( "red", "blue" ), 
    labels = c( paste( "AUC=", round( auc( roc_knn ), 3 ), "; Raw data; " ),
                paste( "AUC=", round( auc( roc_knn_trans ), 3 ), "; Transformed data" ) ) ) +
  theme( legend.title = element_blank() ) +
  theme( legend.position = c( .7, .3 ), text = element_text( size = 17 ) ) + 
    geom_segment( aes( x = 1, xend = 0, y = 0, yend = 1 ), color = "grey", linetype = "dashed" )

Optimal value of k for the kNN algorithm

To find out the optimal value of k based on Error Rate, for the different values of k from 1 to 30, we run the k-nearest neighbor for the test set and compute the Error Rate for these models, by running kNN.plot() command

kNN.plot( formula, train = train_set, test = test_set, transform = "minmax", 
          k.max = 30, set.seed = 3 )

The plot shows that the minimum value of Error Rate is for the case that k is 13; the smaller values of Error Rate indicates better predictions.



Try the liver package in your browser

Any scripts or data that you put into this service are public.

liver documentation built on Oct. 27, 2021, 5:06 p.m.