In rishi1226/classrish: Functions for reproducing analysis of classification

library(devtools)
library(classrish)
library(ggplot2)
library(randomForest)
#devtools::install_github("rishi1226/classrish")
library(classrish)

This Vignette closely follows the process provided in the competition.

KNN

First we load the data using the \code{classrish::PrepareData} function. Note that we only load the first 100 rows for faster execution

path <- "/home/rishabh/mres/ml_comp/data/"
data1 <- classrish::PrepareData(path = path, mode = 2, sample = TRUE, size = 100)

I use Knn as a baseline because Knn classifiers are susceptible to both outliers (as they use \emph{distance} to predict) and the \emph{the curse of dimensionality}. Since, the features are a mix of categorical and continuous, it is recommended that they be resized to a range between 0 and 1. First I identified the best k by training a 10 cross validated knn for values $1,3,...,15$. The error returned were as follows:

k.seq <- seq(1, 15, 2)
knn.result <- classrish::KNN1(data1, k.seq)
(knn.result$error)
(ggplot2::ggplot( knn.result$error, aes(x = k, y = error)) + geom_line() + geom_point())

Random Forest

The next classifier I chose to test is Random Forest. I choose this because it is essentially an averaging over many decision trees. This means that the features are invariant to monotonic transformations. Nevertheless, I fit the classifier on both raw and normalised data and see very similar results.

To train the classifier I did a grid search between the parameters \emph{mtry} {a function of number of remaining predictor variables to use as the mtry parameter in the randomForest call} and \emph{ntree} {the number of trees and selected the combination with the least amount of error}.

data2 <- classrish::PrepareData(path = path, mode = 0, sample = TRUE, size = 100)
ntree.vec <- seq(50, 100, 10)
rf1.result <- classrish::RF1(data2, ntree.vec)

(rf1.result$error)

I trained the classifier with normalised data and got similar results as follows:

data3 <- classrish::PrepareData(path = path, mode = 1, sample = TRUE, size = 100)
ntree.vec <- seq(50, 100, 10)
rf2.result <- classrish::RF1(data3, ntree.vec)

(rf2.result$error)

Bagging

For bagging, I used a classification tree as the weak classifier. I trained the classifier by computing cross validated error across different values of \emph{mfinal}- the number of iterations. Below is a figure of errors.

I will limit the data to 50 points as crossvalidation and grid search takes about 4 hours on an powerful AWS server with the full data.

data4 <- classrish::PrepareData(path = path, mode = 1, sample = TRUE, size = 50)
mfinal.seq <- seq(20, 30, 2)
bagging.result <- classrish::Adabag(data1, mfinal.seq)

(bagging.result$error)
(ggplot2::ggplot( bagging.result$error, aes(x = iter, y = error)) + geom_line() + geom_point())