In ant-stephenson/lhc: LHC Higgs Boson Classification Challenge

knitr::opts_chunk$set(echo=T, fig.align = "center", cache=T)

library(ggplot2)
library(dplyr)
library(fs)
library(purrr)
library(lhc)

Analyse Experiments

As mentioned previously, we ran a series of experiments running a logistic regression model with varying model parameters. The experiments were run in a separate script RunModels.R, and the output of the experiments are saved in a .csv file in the output folder of the git repository.

Import results

First, import the results from our experiments, for analysis.

filepath <- path_join(c(dirname(getwd()), "/output/results_6.csv"))
exp_data <- read.csv(filepath) %>% select(-X, -C, -K)

Experiments were run over all combinations of the following parameters:

#the number of rbf features
(n_rbfs <- unique(exp_data$n_rbf))

#the l2 regularisation parameter
(lambdas <- unique(exp_data$lambda))

#and the degree of polynomial transform (with no interaction terms)
(poly_orders <- unique(exp_data$poly))

Since we want a model with high performance but low variance, we measured the median absolute deviation (mad) of the AMS and AUC metrics over the k-folds. Now we can consider maximising the mean(AMS)/mad(AMS).

# scaling metrics by MAD(metric) to get a handle on variation
exp_data$scaled.auc <- exp_data[,"auc"]/exp_data[,"mad.auc."]
exp_data$scaled.ams <- exp_data[,"ams"]/exp_data[,"mad.ams."]

Plot metrics

We want to analyse our results in order to choose the best model from the set of models generated by our gridsearch over parameter space. To understand how the performance of these models varies in this space, we chose to visualise this performance in a series of box plots. These plots measure our primary performance metric (AMS) against the parameters we are searching over ($\lambda, b, n_{rbf}$).

#convert columns to factors for boxplots
exp_data$lambda <- signif(exp_data$lambda,3)
cols <- c("G", "lambda", "poly", "n_rbf")
exp_data[cols] <- lapply(exp_data[,cols], factor) 

(p1 <- exp_data %>%
  ggplot(aes(x=n_rbf, y=ams, colour=G)) +
    geom_boxplot() +
    theme_minimal() +
    labs(x="Number of RBF features", y="AMS",
         title="AMS of experiments with varying RBF features",
         colour="Jet Group"))

(p2 <- exp_data %>%
  ggplot(aes(x=poly, y=ams, colour=G)) +
    geom_boxplot() + 
    theme_minimal() +
    labs(x="Order of polynomial transform", y="AMS",
         title="AMS of experiments with varying polynomal transform",
         colour="Jet Group"))

(p3 <- exp_data %>%
  ggplot(aes(x=lambda, y=ams, colour=G)) +
    geom_boxplot() + 
    theme_minimal() +
    labs(x="Strength of l2 regularisation", y="AMS",
         title="AMS of experiments with varying l2 regularisation",
         colour="Jet Group"))

Record results

pdf("../doc/figs/Experiments_nrbf.pdf", width=5, height=4)
p1
dev.off()
pdf("../doc/figs/Experiments_poly.pdf", width=5, height=4)
p2
dev.off()
pdf("../doc/figs/Experiments_lambda.pdf", width=5, height=4)
p3
dev.off()

Print results as a LaTeX table that can be imported into a report easily. Not too surprisingly we see different models appear at the top of our list depending on how we decide to choose the best model. Since our goal is ultimately to maximise AMS, it makes sense that this (rather than AUC) should be the metric we choose, despite the higher variance it displays. To try and mitigate this variance, we could choose to sort by a scaled version of AMS, where we take the mean absolute deviation across the folds and scale by that, to try and optimise for the best result with the least variation and therefore hopefully the best generalisation. Since the mean absolute deviation is relatively small compared to the AMS measured, we decided to stick with the pure metric and choose our best model by ranking the results by CV OOS AMS.

library(Hmisc)
#For each group find the experiment with the highest ams
output <- NULL
for(i in 1:3){
  exp_sub <- filter(exp_data, G==i) %>%
    arrange(desc(ams))

  output <- rbind(output, exp_sub[1,])
}

#filter columns
output <- select(output, 
                 G, n_rbf, lambda, poly, auc, ams, mad.ams., scaled.ams)
output[,5:8] <- round(output[,5:8], 3)

# make column names latex friendly
colnames(output) <- sub("(\\w+)\\.(\\w+)\\.?", "\\$\\1_\\{\\2\\}\\$", colnames(output))
colnames(output) <- sub("n\\_rbf", "\\$n_{rbf}\\$", colnames(output))

# Generate LaTeX table
latex(output, file=path_join(c(dirname(getwd()), "doc/results_table_groups.tex")), caption="Results table of top experiment for each of the groups", label="table:results", rowname = NULL)

We can see the final results for the best models, by group, in table \ref{table:results}. These results are used in the following section to generate our final model and subsequently our final performance on the hold-out test data-set. \input{../doc/results_table_groups}

ant-stephenson/lhc documentation built on Jan. 28, 2021, 3:47 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com