library(devtools) library(classrish) library(ggplot2) library(randomForest) #devtools::install_github("rishi1226/classrish") library(classrish)
This Vignette closely follows the process provided in the competition.
First we load the data using the \code{classrish::PrepareData} function. Note that we only load the first 100 rows for faster execution
path <- "/home/rishabh/mres/ml_comp/data/" data1 <- classrish::PrepareData(path = path, mode = 2, sample = TRUE, size = 100)
I use Knn as a baseline because Knn classifiers are susceptible to both outliers (as they use \emph{distance} to predict) and the \emph{the curse of dimensionality}. Since, the features are a mix of categorical and continuous, it is recommended that they be resized to a range between 0 and 1. First I identified the best k by training a 10 cross validated knn for values $1,3,...,15$. The error returned were as follows:
k.seq <- seq(1, 15, 2) knn.result <- classrish::KNN1(data1, k.seq) (knn.result$error) (ggplot2::ggplot( knn.result$error, aes(x = k, y = error)) + geom_line() + geom_point())
The next classifier I chose to test is Random Forest. I choose this because it is essentially an averaging over many decision trees. This means that the features are invariant to monotonic transformations. Nevertheless, I fit the classifier on both raw and normalised data and see very similar results.
To train the classifier I did a grid search between the parameters \emph{mtry} {a function of number of remaining predictor variables to use as the mtry parameter in the randomForest call} and \emph{ntree} {the number of trees and selected the combination with the least amount of error}.
data2 <- classrish::PrepareData(path = path, mode = 0, sample = TRUE, size = 100) ntree.vec <- seq(50, 100, 10) rf1.result <- classrish::RF1(data2, ntree.vec)
(rf1.result$error)
I trained the classifier with normalised data and got similar results as follows:
data3 <- classrish::PrepareData(path = path, mode = 1, sample = TRUE, size = 100) ntree.vec <- seq(50, 100, 10) rf2.result <- classrish::RF1(data3, ntree.vec)
(rf2.result$error)
For bagging, I used a classification tree as the weak classifier. I trained the classifier by computing cross validated error across different values of \emph{mfinal}- the number of iterations. Below is a figure of errors.
I will limit the data to 50 points as crossvalidation and grid search takes about 4 hours on an powerful AWS server with the full data.
data4 <- classrish::PrepareData(path = path, mode = 1, sample = TRUE, size = 50) mfinal.seq <- seq(20, 30, 2) bagging.result <- classrish::Adabag(data1, mfinal.seq) (bagging.result$error) (ggplot2::ggplot( bagging.result$error, aes(x = iter, y = error)) + geom_line() + geom_point())
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.