library(devtools) library(classrish) library(ggplot2) library(randomForest) #devtools::install_github("rishi1226/classrish") library(classrish)

This Vignette closely follows the process provided in the competition.

First we load the data using the \code{classrish::PrepareData} function. Note that we only load the first 100 rows for faster execution

path <- "/home/rishabh/mres/ml_comp/data/" data1 <- classrish::PrepareData(path = path, mode = 2, sample = TRUE, size = 100)

I use Knn as a baseline because Knn classifiers are susceptible to both outliers (as they use \emph{distance} to predict) and the \emph{the curse of dimensionality}. Since, the features are a mix of categorical and continuous, it is recommended that they be resized to a range between 0 and 1. First I identified the best k by training a 10 cross validated knn for values $1,3,...,15$. The error returned were as follows:

k.seq <- seq(1, 15, 2) knn.result <- classrish::KNN1(data1, k.seq) (knn.result$error) (ggplot2::ggplot( knn.result$error, aes(x = k, y = error)) + geom_line() + geom_point())

The next classifier I chose to test is Random Forest. I choose this because it is essentially an averaging over many decision trees. This means that the features are invariant to monotonic transformations. Nevertheless, I fit the classifier on both raw and normalised data and see very similar results.

To train the classifier I did a grid search between the parameters \emph{mtry} {a function of number of remaining predictor variables to use as the mtry parameter in the randomForest call} and \emph{ntree} {the number of trees and selected the combination with the least amount of error}.

data2 <- classrish::PrepareData(path = path, mode = 0, sample = TRUE, size = 100) ntree.vec <- seq(50, 100, 10) rf1.result <- classrish::RF1(data2, ntree.vec)

```
(rf1.result$error)
```

I trained the classifier with normalised data and got similar results as follows:

data3 <- classrish::PrepareData(path = path, mode = 1, sample = TRUE, size = 100) ntree.vec <- seq(50, 100, 10) rf2.result <- classrish::RF1(data3, ntree.vec)

```
(rf2.result$error)
```

For bagging, I used a classification tree as the weak classifier. I trained the classifier by computing cross validated error across different values of \emph{mfinal}- the number of iterations. Below is a figure of errors.

I will limit the data to 50 points as crossvalidation and grid search takes about 4 hours on an powerful AWS server with the full data.

data4 <- classrish::PrepareData(path = path, mode = 1, sample = TRUE, size = 50) mfinal.seq <- seq(20, 30, 2) bagging.result <- classrish::Adabag(data1, mfinal.seq) (bagging.result$error) (ggplot2::ggplot( bagging.result$error, aes(x = iter, y = error)) + geom_line() + geom_point())

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.