In this tutorial, we present the exprso package for R, a library built to tackle a wide variety of supervised machine learning tasks, including the construction of ensemble classifiers. We designed exprso using a modular framework, whereby each function acts as a self-contained, yet interchangeable, part of the whole. With these modules, the investigator has access to multiple tools that they can combine in almost any sequence to build their own personalized machine learning pipelines on the fly. In this way, we balance the simplicity of automation with endless customization, all while maintaining software extensibility.
We can install the most recent version of exprso directly from CRAN.
install.packages("exprso") library(exprso)
library(exprso) set.seed(1)
This package contains five object types that handle the machine learning procedures:
ExprsBinary
and ExprsMulti
handle dichotomous and multi-class data, respectively.ExprsMachine
and ExprsModule
handle dichotomous and multi-class classifiers, respectively.Functions included in this package rely on the objects listed above. Some of these functions return an updated version of the same object type provided, while others return a new object type. We have adopted a nomenclature to help organize the functions available in this package. In this scheme, most functions have a few letters in the beginning of their name which designate their use:
ExprsArray
object.ExprsArray
objects. Returns an updated ExprsArray
object. See ?mod
.ExprsArray
objects into training and test sets. Returns an ExprsArray
list. See ?split
.ExprsArray
object. See ?fs
.ExprsModel
or ExprsEnsemble
object. See ?build
.ExprsPipeline
object. See ?pl
.ExprsPipeline
objects. Returns an updated ExprsPipeline
object. See ?pipe
.knitr::include_graphics("exprso-diagram.jpg")
We recommend importing data using the exprso
function. This function has two arguments. The first expects the data with samples as rows and features as columns. The second expects the annotations with samples as rows where the first column contains the outcome to predict.
data(iris) array <- exprso(iris[1:80, 1:4], iris[1:80, 5])
To subset an ExprsArray
object, we provide methods for the [
and $
operators that access the @annot
annotations slot directly. Alternatively, one could use the modSubset
or subset
functions. Note that the "defineCase" column always contains the outcome to predict. For binary classification, this is always coded as "Case" and "Control".
sub <- array[array$defineCase == "Case", ] sub <- modSubset(array, colBy = "defineCase", include = "Case") sub <- subset(array, subset = array$defineCase == "Case")
When performing classification, an investigator will typically withhold some percentage of the data to use later when assessing classifier performance, effectively splitting the data into two. The first dataset, called the training set, gets used to build the model, while the other, called the external validation or test set, gets used to evaluate the model. This package offers two convenience functions for splitting the data, splitSample
and splitStratify
. The former builds the training set based on simple random sampling (with or without replacement), assigning the remaining subjects to the test set. The latter builds the training set using stratified random sampling. These functions both return a list of two ExprsArray
objects corresponding to the training set and test set respectively. Below, we use the splitStratify
function to build the training and test sets through a stratified random sample across the dichotomous (binary) classification annotation.
arrays <- splitStratify(array, percent.include = 67, colBy = NULL) array.train <- arrays[[1]]
All subjects not included in the training set (based on the percent.include
argument) will automatically get assigned to the test set. Sometimes, when using splitStratify
on a dataset with an unequal number of annotated subjects, the resultant test set may contain relative class frequencies that differ from the training set. If needed, we can fix this so-called "imbalance" at the cost of reducing sample size by performing splitStratify
a second time. Now, we will use the test set as the input and let percent.include = 100
(keeping the other parameters the same). This will split the test set such that the new "training set" (i.e., slot 1) now contains the balanced test set and the new "test set" (i.e., slot 2) now contains the "spillover".
balance <- splitStratify(arrays[[2]], percent.include = 100, colBy = NULL) array.test <- balance[[1]]
Considering the high-dimensionality of many datasets, it is prudent and often necessary to prioritize which features to include during classifier construction. This package provides functions for some of the most frequently used feature selection methods. Each function works as a self-contained wrapper that (1) pre-processes the ExprsArray
input, (2) performs the feature selection, and (3) returns an ExprsArray
output with an updated feature selection history. These histories get passed along at every step of the way until they eventually get used to pre-process an unlabeled dataset during classifier deployment (i.e., prediction).
One feature selection function is fsStats
. This performs basic feature selection based on simple statistical tests. Specifically, this function will rank features using either the Student's $t$-test or the Kolmogorov-Smirnov test. Below, the argument top = 0
tells tells the program to rank all features.
array.train <- fsStats(array.train, top = 0, how = "t.test")
The argument top
specifies either the names or the number of features to supply to the feature selection method, not what the user intends to retrieve from the feature selection method. When calling the first feature selection method (or the first build method, if skipping feature selection), a numeric top
argument will select a "top ranked" feature set according to their default order in the ExprsArray
input. Then, because each feature selection method returns an ExprsArray
object with the features implicitly (re-)ranked, all subsequent numeric top
arguments will select a "top ranked" feature set according to the results of the previous feature selection method. For example, the third feature selection call draws the top features from the second feature ranking. The user may deploy, in tandem, any number of these functions in whatever order they choose.
Another feature selection function is fsPrcomp
. This performs dimension reduction by way of principal components analysis (PCA). Like the feature selection steps, all dimension reduction models get saved in the ExprsArray
history to deploy later on a test set. Below, we use the top 50 features for PCA.
array.train <- fsPrcomp(array.train, top = 50)
The other feature selection methods included in this package all follow the same use pattern. Below, we plot the first three components of the training set in 3-dimensional space.
plot(array.train)
This package provides functions for several supervised machine learning methods, including support vector machines, artificial neural networks, random forests, and more. These functions require an ExprsArray
object as input and return an ExprsModel
object as output. This ExprsModel
object contains the feature selection history that led up to classifier construction as well as the classifier itself. Below, we build an artificial neural network with five intermediate nodes in the hidden layer using the top 10 components from the training set above.
mach <- buildANN(array.train, top = 10, size = 5)
We deploy an ExprsModel
object using predict
. This function returns an ExprsPredict
object containing the prediction results in three forms: prediction, probability, and decision boundary predictions. The probability and decision boundary predictions relate to one another by a logistic transformation. The prediction (@pred
) slot converts these metrics into a single "all-or-nothing" class label assignment.
Another function, calcStats
, allows us to compare the prediction results against the actual class labels. The aucSkip
argument specifies whether to calculate the area under the receiver operating characteristic (ROC) curve. Note, however, that performance metrics calculated using the ROC curve may differ from those calculated using a confusion matrix because the former may adjust the discrimination threshold to optimize sensitivity and specificity. The discrimination threshold is automatically chosen as the point along the ROC curve which minimizes the Euclidean distance from (0, 1). Below, we deploy a classifier on the test set, then use the result to calculate classifier performance.
pred <- predict(mach, array.test)
calcStats(pred)
This package includes several functions named with the prefix pl
. These "pipeline" functions exist to help with high-throughput learning. In other words, they wrap repetitive tasks into a single call. This includes extensive parameter searches as well as some elaborate cross-validation. Some of these pl
functions can even have other pl
functions embedded within them. For example, the function plGrid
contains the function plCV
for managing simple $v$-fold and leave-one-out cross-validation.
When constructing a classifier using a build method, we can only specify one set of parameters at a time. However, we often want to test models across a vast range of parameters. For this task, we provide the plGrid
function. This function builds and deploys a model for each combination of all provided arguments. For example, calling plGrid
with the arguments how = "buildSVM"
, top = c(3, 5, 10)
, cost = 10^(-3:3)
, and kernel = c("linear", "radial")
will yield 42 classifiers.
We note here that this function only accepts one how
per run. To analyze the results of multiple build
parameter searches jointly, combine the results of multiple plGrid
function calls using ?conjoin
. We will also note that plGrid
does not execute any data splitting or feature selection, both of which the user may perform beforehand. However, plGrid
does allow the user to specify multiple classifier sizes by providing a numeric vector as the top
argument.
The plGrid
function can also calculate $v$-fold cross-validation accuracy at each step of the parameter search (toggled by supplying a non-NULL
argument to fold
). We emphasize, however, that the cross-validation method embedded within plGrid
(i.e., plCV
) does not re-select features with each fold, which may lead to overly-optimistic measures of classifier performance in the setting of prior feature selection.
Below, we run through a few different support vector machine builds, calculating leave-one-out cross-validation accuracy (i.e., via fold = 0
) at each step.
gs <- plGrid(array.train = array.train, array.valid = array.test, top = c(2, 4), how = "buildSVM", fold = 0, kernel = "linear", cost = 10^(-3:3) )
The returned object contains two slots, @summary
and @machs
, which store the performance summary and corresponding ExprsModel
objects, respectively. The performance summary contains columns detailing the parameters used to build each machine along with performance metrics for the training set (and test set, if provided). Columns named with "train" describe training set performances. Columns named with "valid" describe test set performances. The column, "train.plCV"
, contains the cross-validation accuracy, if performed. The returned ExprsPipeline
object also contains an ExprsModel
object for each entry in the performance summary.
To subset an ExprsPipeline
object, we provide methods for the [
and $
operators that access the @summary
performance summary slot directly. Alternatively, one could use the pipeSubset
or subset
functions.
sub <- gs[gs$cost == 1, ] sub <- pipeSubset(gs, colBy = "cost", include = 1) sub <- subset(gs, subset = gs$cost == 1)
The exprso package also provides a means by which to perform elaborate cross-validation, including Monte Carlo style and 2-layer "nested" cross-validation. Analogous to how plGrid
manages multiple build and predict tasks, these pipelines (i.e., plMonteCarlo
and plNested
) effectively manage multiple plGrid
tasks. In order to organize the sheer number of arguments necessary to execute these functions, we have implemented argument handler functions (i.e., ctrlSplitSet
, ctrlFeatureSelect
, and ctrlGridSearch
) that handle data splitting, feature selection, and grid searching, respectively.
In simplest terms, plMonteCarlo
and plNested
use a single training set to calculate classifier performances on a withheld internal validation set. This internal validation set serves as a kind of proxy for a statistically independent test set. The main difference between plMonteCarlo
and plNested
stems from how the internal validation set gets constructed. On one hand, the plMonteCarlo
method uses the ctrlSplitSet
argument handler to split the training set into a training subset and an intenral validation set with each bootstrap. On the other hand, the plNested
method splits the training set into $v$-folds, treating each fold as an internal validation set while treating those outside that fold as the training subset.
For clarity, we call any performance measured on an internal validation set as the outer-loop cross-validation performance and any cross-validation accuracy measured using the training subset (i.e., via plGrid
) the inner-loop cross-validation performance. In the performance summaries of the ExprsPipeline
objects returned by plMonteCarlo
and plNested
, columns named with "train" describe training subset performances while columns named with "valid" describe internal validation set performances. Although the inner-loop cross-validation performances (i.e., via plCV
) can still over-estimate cross-validation through prior feature selection, the outer-loop cross-validation performances derive from classifiers that have undergone feature selection anew with each bootstrap or fold. However, we emphasize here that performing feature selection on a training set prior to the use of plMonteCarlo
or plNested
can still result in overly optimistic outer-loop cross-validation performances.
In the example below, we perform five iterations of plMonteCarlo
using the original training set as it existed before it underwent any feature selection (i.e., the first slot of the object arrays
). With each iteration, we (1) sample the subjects randomly through bagging (i.e., random sampling with replacement), (2) perform feature selection using the Student's t-test, and then (3) execute a grid-search across multiple support vector machine parameters and classifier sizes. In this framework, the user could instead perform any number of feature selection tasks simply by supplying a list of multiple ctrlFeatureSelect
argument handlers to the ctrlFS
argument below.
ss <- ctrlSplitSet(func = "splitSample", percent.include = 67, replace = TRUE) fs <- ctrlFeatureSelect(func = "fsStats", top = 0, how = "t.test") gs <- ctrlGridSearch(func = "plGrid", how = "buildSVM", top = c(2, 4), kernel = "linear", cost = 10^(-3:3), fold = 10)
boot <- plMonteCarlo(arrays[[1]], B = 5, ctrlSS = ss, ctrlFS = fs, ctrlGS = gs)
Next, we reduce the results of plMonteCarlo
to a single performance metric by feeding the returned ExprsPipeline
object through calcMonteCarlo
. Note that this helper function will fail unless plGrid
has called plCV
during the parameter grid-search.
calcMonteCarlo(boot, colBy = "valid.auc")
This package provides two ways to build ensemble classifiers. The first involves manually combining multiple ExprsModel
objects together through the function buildEnsemble
(or ?conjoin
). The second involves an orchestrated manipulation of an ExprsPipeline
object through the pipeFilter
function.
This latter approach filters an ExprsPipeline
object in (up to) three steps. First, a threshold filter gets imposed, whereby any model with a performance less than the threshold filter, how
, gets excluded. Second, a ceiling filter gets imposed, whereby any model with a performance greater than the ceiling filter, gate
, gets excluded. Third, an arbitrary subset occurs, whereby the top N models in the ExprsPipeline
object get selected based on the argument top
. In the case that the @summary
slot contains the column "boot" (e.g., in the results of plMonteCarlo
), pipeFilter
selects the top N models for each unique bootstrap. The user may skip any one of these three filter steps by setting the respective argument to 0.
When calling the buildEnsemble
method for an ExprsPipeline
object, any classifiers remaining after the pipeFilter
filter will get assembled into a single ensemble classifier. Ensemble classifiers get stored as an ExprsEnsemble
object which is simply a container for a list of multiple ExprsModel
objects.
In the example below, we we will build an ensemble using the single best classifier from each plMonteCarlo
bootstrap, and then deploy that ensemble on the withheld test set from above.
ens <- buildEnsemble(boot, top = 1, colBy = "valid.auc") pred <- predict(ens, array.test, how = "majority")
calcStats(pred)
Owing to how the pipeFilter
function handles ExprsPipeline
objects that contain a "boot" column in the performance summary (i.e., @summary
), we include the pipeUnboot
function to rename this "boot" column to "unboot". To learn more about how ExprsEnsemble
predicts class labels, we refer the user to the documentation, ?'exprso-predict'
. In addition, we encourage the user to visit the documentation, ?'ExprsPipeline-class'
and ?'ExprsEnsemble-class'
.
We conclude this vignette by alerting the user that the exprso package also includes a framework for performing multi-class classification in an automated manner. These methods use the "1-vs-all" approach to multi-class classification, whereby each individual class label has a turn getting treated as the positive class label in a dichotomous (binary) scheme. Then, the results of each iteration get integrated into a single construct. To learn more about multi-class classification, we refer the user to the documentation for ?doMulti
and the companion vignette, "Advanced Topics for the exprso package".
The exprso package framework also extends to building and deploying regression models to predict continuous outcomes. All pipeline and ensemble methods discussed here also apply to regression, although some fs
and build
methods work only for classification.
Thank you for your interest in exprso. Although we have made tremendous progress in formalizing this library in a reliable package framework, some of the tools included here may change. To the best of knowledge, we have followed the machine learning "best practices" when developing this software, but if you know better than us, please let us know! File any and all issues at GitHub. In addition, we always welcome suggestions for new tools that we could include in future releases. Happy learning!
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.