knitr::opts_chunk$set(
    message = FALSE,
    warning = FALSE,
    collapse = TRUE,
    comment = "#>"
)

Introduction

Supervised learning is the branch of statistical learning where we construct a predictive model with the goal of regression or classification. The splendid package focuses on classification with an ensemble framework: many classification algorithms are used, and prediction is done across bootstrap replicates of the data. An ensemble classifier is built from the best performing algorithms according to evaluation measures. No single classifier always performs the best for every data set that exists, so there is increasing utility to come up with ensemble classifiers. The objective is to use this classifier to obtain highly accurate predictions in independent data sets for the purposes of diagnostic identification. In genomic studies for example, one might be interested in using the class labels of a cancer subtype in one cohort to predict the subtypes in an independent cohort.

splendid is currently only available on GitHub (the second line below will be uncommented once the repository becomes public).

# install.packages("devtools")
# devtools::install_github("AlineTalhouk/splendid")
library(splendid)
library(knitr)
data(hgsc)

Overview

The main function of splendid is splendid(), and the usage is intuitive:

class <- attr(hgsc, "class.true")
table(class)
sl_result <- splendid(data = hgsc, class = class, n = 2,
                      algorithms = c("slda", "knn", "svm"), rfe = FALSE)

The resulting object is a list with the following elements:

str(sl_result, max.level = 2)

Classification

The first step in the splendid pipeline is classification. Given explanatory data and a reference response of classes, we wish to build a classifier that can accurately predict class representation in a separate validation data set. To avoid overfitting, we split the given data into a training set and test set. The training set is constructed by taking a random sample with replacement of all samples. The test set is comprised of all samples which are not included in the corresponding training set, also known as an OOB sample. This is a simple bootstrap resampling scheme, which we replicate a sufficient number of times to capture sampling variability.

Algorithms

There exist a vast number of classification algorithms. Those currently in splendid are:

These algorithms are implemented in classification().

Hyperparameters and other Details

Certain functions have hyperparameters that need to be tuned in order to select the best model before prediction. We use a grid search on a pre-specified range of the hyperparameters and choose the optimal values using caret::train().

The ranges for the tuning parameters are:

We use e1071::tune() for neural network parameters. The ranges are:

Some algorithms have certain properties that require data manipulations before classification.

Feature Selection

We use Recursive Feature Elimination (RFE)^3^ on lda, rf, svm, and adaboost_m1 to reduce the dimensionality before classification. In svm, we do this before tuning because of the computational complexity. However, we can embed this feature selection within the tuning step for the other three algorithms. Set rfe = TRUE in splendid to use feature selection.

One limitation of RFE is that an a priori set of feature subset sizes need to be specified, determining the search space for the algorithm. By default, we set the sizes parameter to be every 5 integers from 0 up to one-half of the smallest class size. Recall the class sizes:

table(class)

We tell RFE to search for the best models with 5, 10, ..., 50 features. Cross-validation with 2 folds is used in the algorithm.

Prediction

There is a different prediction method for each classifier, since they are all imported from an external package. The prediction() function calls each method based on the class of the model output from classification(). prediction() also performs some manipulations so that the results all have the same data structure: unnamed factors, with labels given in the same order as the true class labels.

The method for "pam" is an exception: the output from prediction() not only has the predicted class labels, but also the cross-validated threshold value calculated from the training set to use in prediction on the test set, named delta.

str(sl_result$preds, max.level = 2, list.len = 2)

Attributes

Note that each predicted class has an attribute that shows the class probabilities in a matrix. We ensure that the class probabilities for every sample sum to one by making a small adjustment to one of the classes depending on whether the probability was over or under one. These matrices are useful to compute evaluation measures and generate discriminating graphs that we detail below, and is stored in attr(*, "prob").

To obtain better performance under evaluation metrics, we may want to exclude samples which have a maximum class probability below a threshold. For example, we may only compare the true test labels with the corresponding OOB samples where the winning class has a probability at least 50\%. Samples below this threshold are labelled as "unclassified". If the threshold results in all samples being unclassified, then we use the unfiltered, original predicted labels for evaluation. This object is stored in attr(*, "class.thres").

The final attribute of a prediction object is stored attr(*, "class.prop"), showing the proportion of classified cases. This value is inversely proportional to the threshold setting.

Evaluation

Metrics

Evaluation measures are important because they tell us the prediction performance of a classifier. The table below shows all the measures for svm, for two bootstrap replicates. The logloss, auc and pdi measures make use of the prediction probabilities, whereas the rest are computed from the multiclass confusion matrix of reference and predicted class labels.

kable(sl_result$evals$svm, caption = "SVM Evaluation Metrics")

Note that in the multiclass case, we have variants for some of the measures listed above, depending on how we are looking at the data. A macro-averaged metric calculates said metric on each of the one-vs-all confusion matrices and then takes the mean. A micro-averaged metric is calculated on the element-wise sum of all one-vs-all confusion matrices.

Hence, there are macro_ppv, macro_npv, macro_sensitivity, macro_specificity, and macro_f1 macro-averaged metrics, and micro_mcc as the "only" micro-averaged metric. It turns out that accuracy is mathematically equivalent to micro-averaged PPV, sensitivity, and F1-score, so we don't redundantly add those to the list. Furthermore, calculating mcc in a macro-averaged way is not recommended, so we use it on the full confusion matrix.

Class-specific measures fill in the rest of the matrix, labelled with the naming scheme metric.class.

The .632 estimator are implemented for the multiclass log loss function. This error estimate aims to make a compromise between an overbiased prediction like the leave-one-out bootstrap error with an underbiased prediction like the training (or apparent) error. If we set plus = TRUE in splendid() (default), one can calculate the .632+ estimator, an improvement over the .632 estimator that takes into account the amount of overfitting.^5^ The value is stored as an attribute in the evaluation object:

vapply(sl_result$evals, attr, "err_632plus", FUN.VALUE = double(1))

Plots

To assess the performance of a classifier, we can look at a discriminating plot and reliability plot.

data(hgsc)
class <- factor(attr(hgsc, "class.true"))
set.seed(1)
training.id <- sample(seq_along(class), replace = TRUE)
test.id <- which(!seq_along(class) %in% training.id)
mod <- classification(hgsc[training.id, ], class[training.id], "adaboost")
pred <- prediction(mod, hgsc, test.id, class = class)
evals <- evaluation(class[test.id], pred, plot = TRUE)

Ensemble Construction

For each bootstrap replicate training set, we want to find the top performing classifier. We use Rank Aggregation with the Genetic Algorithm to choose the top algorithm by comparing across evaluation metrics^4^. In the case of log loss, we need to first invert its value since it its objective function is minimization. Class-specific measures are not included in the rank aggregation because they are interdependent. For example, a sample with a true class of "A" should not be treated differently depending on whether it was misclassified into "B" or "C".

After obtaining the list of top classifiers for each bootstrap replicate, we sort them by decreasing frequency and keep the top 3 to use in the ensemble. The ensemble_mods output of splendid is a list of models fit on the full data for the top classifiers.

References

  1. Ahdesmäki, Miika, and Korbinian Strimmer. "Feature selection in omics prediction problems using cat scores and false nondiscovery rate control." The Annals of Applied Statistics 4.1 (2010): 503-519.
  2. https://web.stanford.edu/~hastie/glmnet/glmnet_beta.html
  3. https://topepo.github.io/caret/recursive-feature-elimination.html
  4. Pihur, Vasyl, Susmita Datta, and Somnath Datta. "RankAggreg, an R package for weighted rank aggregation." BMC bioinformatics 10.1 (2009): 62.
  5. Efron, Bradley and Tibshirani, Robert (1997), "Improvements on Cross-Validation: The .632+ Bootstrap Method," Journal of American Statistical Association, 92, 438, 548-560.


AlineTalhouk/splendid documentation built on Feb. 23, 2024, 9:37 p.m.