# <center> Example for Data Analysis <center> In liver: "Eating the Liver of Data Science"

```knitr::opts_chunk \$ set( collapse = TRUE, comment = " ", fig.width = 7, fig.height = 7, fig.align = "center" )
```
```library( liver )

library( pROC )

library( ggplot2 )
```

The `liver` package contains a collection of helper functions that make various techniques from data science more user-friendly for non-experts.

Here is an example to show how to use the functionality of the package by using the churn dataset which is available in the package.

```data( churn )

str( churn )
```

It shows that the 'churn' dataset as a `data.frame` has `r ncol( churn )` variables and `r nrow( churn )` observations.

# Partitioning the dataset

We partition the churn dataset randomly into two groups: train set (80%) and test set (20%). Here, we use the `partition` function from the liver package:

```set.seed( 5 )

data_sets = partition( data = churn, prob = c( 0.8, 0.2 ) )

train_set = data_sets \$ part1
test_set  = data_sets \$ part2

actual_test  = test_set \$ churn
```

# Classification by kNN algorithm

The churn dataset has `r ncol( churn ) - 1` predictors along with the target variable `churn`. Here we use the following predictors:

`account.length`, `voice.plan`, `voice.messages`, `intl.plan`, `intl.mins`, `day.mins`, `eve.mins`, `night.mins`, and `customer.calls`.

First, based on the above predictors, find the k-nearest neighbor for the test set, based on the training dataset, for the k = 8 as follows

```formula = churn ~ account.length + voice.plan + voice.messages + intl.plan + intl.mins +
day.mins + eve.mins + night.mins + customer.calls

predict_knn = kNN( formula, train = train_set, test = test_set, k = 8 )
```

To report Confusion Matrix:

```conf.mat( predict_knn, actual_test )

conf.mat.plot( predict_knn, actual_test )
```

To report Mean Squared Error (MSE):

```mse( predict_knn, actual_test )
```

# Classification by kNN algorithm with data transformation

The predictors that we used in the previous part, do not have the same scale. For example, variable `day.mins` change between `r min( churn \$ day.mins )` and `r max( churn \$ day.mins )`, whereas variable `voice.plan` is binary. In this case, the values of variable `day.mins` will overwhelm the contribution of `voice.plan`. To avoid this situation we use normalization. So, we use min-max normalization and transfer the predictors as follows:

```predict_knn_trans = kNN( formula, train = train_set, test = test_set, k = 8, transform = "minmax" )
```

To report Confusion Matrix:

```conf.mat.plot( predict_knn_trans, actual_test )

conf.mat.plot( predict_knn, actual_test )
```

To report the ROC curve, we need the probability of our classification prediction. We can have it by using:

```prob_knn = kNN( formula, train = train_set, test = test_set, k = 8, type = "prob" )[ , 1 ]

prob_knn_trans = kNN( formula, train = train_set, test = test_set, transform = "minmax", k = 8, type = "prob" )[ , 1 ]
```

To visualize the model performance between the raw data and the transformed data, we could report the ROC curve plot as well as AUC (Area Under the Curve) by using the `plot.roc` function from the pROC package:

```roc_knn = roc( actual_test, prob_knn )
roc_knn_trans = roc( actual_test, prob_knn_trans )

ggroc( list( roc_knn, roc_knn_trans ), size = 0.8 ) +
theme_minimal() + ggtitle( "ROC plots with AUC") +
scale_color_manual( values = c( "red", "blue" ),
labels = c( paste( "AUC=", round( auc( roc_knn ), 3 ), "; Raw data; " ),
paste( "AUC=", round( auc( roc_knn_trans ), 3 ), "; Transformed data" ) ) ) +
theme( legend.title = element_blank() ) +
theme( legend.position = c( .7, .3 ), text = element_text( size = 17 ) ) +
geom_segment( aes( x = 1, xend = 0, y = 0, yend = 1 ), color = "grey", linetype = "dashed" )
```

# Optimal value of k for the kNN algorithm

To find out the optimal value of `k` based on Error Rate, for the different values of k from 1 to 30, we run the k-nearest neighbor for the test set and compute the Error Rate for these models, by running `kNN.plot()` command

```kNN.plot( formula, train = train_set, test = test_set, transform = "minmax",
k.max = 30, set.seed = 3 )
```

The plot shows that the minimum value of Error Rate is for the case that k is 13; the smaller values of Error Rate indicates better predictions.

## Try the liver package in your browser

Any scripts or data that you put into this service are public.

liver documentation built on April 11, 2023, 6:04 p.m.