In this tutorial, we present the exprso package for R, a library built to tackle a wide variety of supervised machine learning tasks, including the construction of ensemble classifiers. We designed exprso using a modular framework, whereby each function acts as a self-contained, yet interchangeable, part of the whole. With these modules, the investigator has access to multiple tools that they can combine in almost any sequence to build their own personalized machine learning pipelines on the fly. In this way, we balance the simplicity of automation with endless customization, all while maintaining software extensibility.
We can install the most recent version of exprso directly from CRAN.
This package contains five object types that handle the machine learning procedures:
ExprsMultihandle dichotomous and multi-class data, respectively.
ExprsModulehandle dichotomous and multi-class classifiers, respectively.
Functions included in this package rely on the objects listed above. Some of these functions return an updated version of the same object type provided, while others return a new object type. We have adopted a nomenclature to help organize the functions available in this package. In this scheme, most functions have a few letters in the beginning of their name which designate their use:
ExprsArrayobjects. Returns an updated
ExprsArrayobjects into training and test sets. Returns an
ExprsPipelineobjects. Returns an updated
We recommend importing data using the
exprso function. This function has two arguments. The first expects the data with samples as rows and features as columns. The second expects the annotations with samples as rows where the first column contains the outcome to predict.
data(iris) array <- exprso(iris[1:80, 1:4], iris[1:80, 5])
To subset an
ExprsArray object, we provide methods for the
$ operators that access the
@annot annotations slot directly. Alternatively, one could use the
subset functions. Note that the "defineCase" column always contains the outcome to predict. For binary classification, this is always coded as "Case" and "Control".
sub <- array[array$defineCase == "Case", ] sub <- modSubset(array, colBy = "defineCase", include = "Case") sub <- subset(array, subset = array$defineCase == "Case")
When performing classification, an investigator will typically withhold some percentage of the data to use later when assessing classifier performance, effectively splitting the data into two. The first dataset, called the training set, gets used to build the model, while the other, called the external validation or test set, gets used to evaluate the model. This package offers two convenience functions for splitting the data,
splitStratify. The former builds the training set based on simple random sampling (with or without replacement), assigning the remaining subjects to the test set. The latter builds the training set using stratified random sampling. These functions both return a list of two
ExprsArray objects corresponding to the training set and test set respectively. Below, we use the
splitStratify function to build the training and test sets through a stratified random sample across the dichotomous (binary) classification annotation.
arrays <- splitStratify(array, percent.include = 67, colBy = NULL) array.train <- arrays[]
All subjects not included in the training set (based on the
percent.include argument) will automatically get assigned to the test set. Sometimes, when using
splitStratify on a dataset with an unequal number of annotated subjects, the resultant test set may contain relative class frequencies that differ from the training set. If needed, we can fix this so-called "imbalance" at the cost of reducing sample size by performing
splitStratify a second time. Now, we will use the test set as the input and let
percent.include = 100 (keeping the other parameters the same). This will split the test set such that the new "training set" (i.e., slot 1) now contains the balanced test set and the new "test set" (i.e., slot 2) now contains the "spillover".
balance <- splitStratify(arrays[], percent.include = 100, colBy = NULL) array.test <- balance[]
Considering the high-dimensionality of many datasets, it is prudent and often necessary to prioritize which features to include during classifier construction. This package provides functions for some of the most frequently used feature selection methods. Each function works as a self-contained wrapper that (1) pre-processes the
ExprsArray input, (2) performs the feature selection, and (3) returns an
ExprsArray output with an updated feature selection history. These histories get passed along at every step of the way until they eventually get used to pre-process an unlabeled dataset during classifier deployment (i.e., prediction).
One feature selection function is
fsStats. This performs basic feature selection based on simple statistical tests. Specifically, this function will rank features using either the Student's $t$-test or the Kolmogorov-Smirnov test. Below, the argument
top = 0 tells tells the program to rank all features.
array.train <- fsStats(array.train, top = 0, how = "t.test")
top specifies either the names or the number of features to supply to the feature selection method, not what the user intends to retrieve from the feature selection method. When calling the first feature selection method (or the first build method, if skipping feature selection), a numeric
top argument will select a "top ranked" feature set according to their default order in the
ExprsArray input. Then, because each feature selection method returns an
ExprsArray object with the features implicitly (re-)ranked, all subsequent numeric
top arguments will select a "top ranked" feature set according to the results of the previous feature selection method. For example, the third feature selection call draws the top features from the second feature ranking. The user may deploy, in tandem, any number of these functions in whatever order they choose.
Another feature selection function is
fsPrcomp. This performs dimension reduction by way of principal components analysis (PCA). Like the feature selection steps, all dimension reduction models get saved in the
ExprsArray history to deploy later on a test set. Below, we use the top 50 features (as selected by
fsStats) for PCA.
array.train <- fsPrcomp(array.train, top = 50)
The other feature selection methods included in this package all follow the same use pattern. Below, we plot the first three components of the training set in 3-dimensional space.
This package provides functions for several supervised machine learning methods, including support vector machines, artificial neural networks, random forests, and more. These functions require an
ExprsArray object as input and return an
ExprsModel object as output. This
ExprsModel object contains the feature selection history that led up to classifier construction as well as the classifier itself. Below, we build an artificial neural network with five intermediate nodes in the hidden layer using the top 10 components from the training set above.
mach <- buildANN(array.train, top = 10, size = 5)
We deploy an
ExprsModel object using
predict. This function returns an
ExprsPredict object containing the prediction results in three forms: prediction, probability, and decision boundary predictions. The probability and decision boundary predictions relate to one another by a logistic transformation. The prediction (
@pred) slot converts these metrics into a single "all-or-nothing" class label assignment.
calcStats, allows us to compare the prediction results against the actual class labels. The
aucSkip argument specifies whether to calculate the area under the receiver operating characteristic (ROC) curve. Note, however, that performance metrics calculated using the ROC curve may differ from those calculated using a confusion matrix because the former may adjust the discrimination threshold to optimize sensitivity and specificity. The discrimination threshold is automatically chosen as the point along the ROC curve which minimizes the Euclidean distance from (0, 1). Below, we deploy a classifier on the test set, then use the result to calculate classifier performance.
pred <- predict(mach, array.test)
This package includes several functions named with the prefix
pl. These "pipeline" functions exist to help with high-throughput learning. In other words, they wrap repetitive tasks into a single call. This includes extensive parameter searches as well as some elaborate cross-validation. Some of these
pl functions can even have other
pl functions embedded within them. For example, the function
plGrid contains the function
plCV for managing simple $v$-fold and leave-one-out cross-validation.
When constructing a classifier using a build method, we can only specify one set of parameters at a time. However, we often want to test models across a vast range of parameters. For this task, we provide the
plGrid function. This function builds and deploys a model for each combination of all provided arguments. For example, calling
plGrid with the arguments
how = "buildSVM",
top = c(3, 5, 10),
cost = 10^(-3:3), and
kernel = c("linear", "radial") will yield 42 classifiers.
We note here that this function only accepts one
how per run. To analyze the results of multiple
build parameter searches jointly, combine the results of multiple
plGrid function calls using
?conjoin. We will also note that
plGrid does not execute any data splitting or feature selection, both of which the user may perform beforehand. However,
plGrid does allow the user to specify multiple classifier sizes by providing a numeric vector as the
plGrid function can also calculate $v$-fold cross-validation accuracy at each step of the parameter search (toggled by supplying a non-
NULL argument to
fold). We emphasize, however, that the cross-validation method embedded within
plCV) does not re-select features with each fold, which may lead to overly-optimistic measures of classifier performance in the setting of prior feature selection.
Below, we run through a few different support vector machine builds, calculating leave-one-out cross-validation accuracy (i.e., via
fold = 0) at each step.
gs <- plGrid(array.train = array.train, array.valid = array.test, top = c(2, 4), how = "buildSVM", fold = 0, kernel = "linear", cost = 10^(-3:3) )
The returned object contains two slots,
@machs, which store the performance summary and corresponding
ExprsModel objects, respectively. The performance summary contains columns detailing the parameters used to build each machine along with performance metrics for the training set (and test set, if provided). Columns named with "train" describe training set performances. Columns named with "valid" describe test set performances. The column,
"train.plCV", contains the cross-validation accuracy, if performed. The returned
ExprsPipeline object also contains an
ExprsModel object for each entry in the performance summary.
To subset an
ExprsPipeline object, we provide methods for the
$ operators that access the
@summary performance summary slot directly. Alternatively, one could use the
sub <- gs[gs$cost == 1, ] sub <- pipeSubset(gs, colBy = "cost", include = 1) sub <- subset(gs, subset = gs$cost == 1)
The exprso package also provides a means by which to perform elaborate cross-validation, including Monte Carlo style and 2-layer "nested" cross-validation. Analogous to how
plGrid manages multiple build and predict tasks, these pipelines (i.e.,
plNested) effectively manage multiple
plGrid tasks. In order to organize the sheer number of arguments necessary to execute these functions, we have implemented argument handler functions (i.e.,
ctrlGridSearch) that handle data splitting, feature selection, and grid searching, respectively.
In simplest terms,
plNested use a single training set to calculate classifier performances on a withheld internal validation set. This internal validation set serves as a kind of proxy for a statistically independent test set. The main difference between
plNested stems from how the internal validation set gets constructed. On one hand, the
plMonteCarlo method uses the
ctrlSplitSet argument handler to split the training set into a training subset and an intenral validation set with each bootstrap. On the other hand, the
plNested method splits the training set into $v$-folds, treating each fold as an internal validation set while treating those outside that fold as the training subset.
For clarity, we call any performance measured on an internal validation set as the outer-loop cross-validation performance and any cross-validation accuracy measured using the training subset (i.e., via
plGrid) the inner-loop cross-validation performance. In the performance summaries of the
ExprsPipeline objects returned by
plNested, columns named with "train" describe training subset performances while columns named with "valid" describe internal validation set performances. Although the inner-loop cross-validation performances (i.e., via
plCV) can still over-estimate cross-validation through prior feature selection, the outer-loop cross-validation performances derive from classifiers that have undergone feature selection anew with each bootstrap or fold. However, we emphasize here that performing feature selection on a training set prior to the use of
plNested can still result in overly optimistic outer-loop cross-validation performances.
In the example below, we perform five iterations of
plMonteCarlo using the original training set as it existed before it underwent any feature selection (i.e., the first slot of the object
arrays). With each iteration, we (1) sample the subjects randomly through bagging (i.e., random sampling with replacement), (2) perform feature selection using the Student's t-test, and then (3) execute a grid-search across multiple support vector machine parameters and classifier sizes. In this framework, the user could instead perform any number of feature selection tasks simply by supplying a list of multiple
ctrlFeatureSelect argument handlers to the
ctrlFS argument below.
ss <- ctrlSplitSet(func = "splitSample", percent.include = 67, replace = TRUE) fs <- ctrlFeatureSelect(func = "fsStats", top = 0, how = "t.test") gs <- ctrlGridSearch(func = "plGrid", how = "buildSVM", top = c(2, 4), kernel = "linear", cost = 10^(-3:3), fold = 10)
boot <- plMonteCarlo(arrays[], B = 5, ctrlSS = ss, ctrlFS = fs, ctrlGS = gs)
Next, we reduce the results of
plMonteCarlo to a single performance metric by feeding the returned
ExprsPipeline object through
calcMonteCarlo. Note that this helper function will fail unless
plGrid has called
plCV during the parameter grid-search.
calcMonteCarlo(boot, colBy = "valid.auc")
This package provides two ways to build ensemble classifiers. The first involves manually combining multiple
ExprsModel objects together through the function
?conjoin). The second involves an orchestrated manipulation of an
ExprsPipeline object through the
This latter approach filters an
ExprsPipeline object in (up to) three steps. First, a threshold filter gets imposed, whereby any model with a performance less than the threshold filter,
how, gets excluded. Second, a ceiling filter gets imposed, whereby any model with a performance greater than the ceiling filter,
gate, gets excluded. Third, an arbitrary subset occurs, whereby the top N models in the
ExprsPipeline object get selected based on the argument
top. In the case that the
@summary slot contains the column "boot" (e.g., in the results of
pipeFilter selects the top N models for each unique bootstrap. The user may skip any one of these three filter steps by setting the respective argument to 0.
When calling the
buildEnsemble method for an
ExprsPipeline object, any classifiers remaining after the
pipeFilter filter will get assembled into a single ensemble classifier. Ensemble classifiers get stored as an
ExprsEnsemble object which is simply a container for a list of multiple
In the example below, we we will build an ensemble using the single best classifier from each
plMonteCarlo bootstrap, and then deploy that ensemble on the withheld test set from above.
ens <- buildEnsemble(boot, top = 1, colBy = "valid.auc") pred <- predict(ens, array.test, how = "majority")
Owing to how the
pipeFilter function handles
ExprsPipeline objects that contain a "boot" column in the performance summary (i.e.,
@summary), we include the
pipeUnboot function to rename this "boot" column to "unboot". To learn more about how
ExprsEnsemble predicts class labels, we refer the user to the documentation,
?'exprso-predict'. In addition, we encourage the user to visit the documentation,
We conclude this vignette by alerting the user that the exprso package also includes a framework for performing multi-class classification in an automated manner. These methods use the "1-vs-all" approach to multi-class classification, whereby each individual class label has a turn getting treated as the positive class label in a dichotomous (binary) scheme. Then, the results of each iteration get integrated into a single construct. To learn more about multi-class classification, we refer the user to the documentation for
?doMulti and the companion vignette, "Advanced Topics for the exprso package".
The exprso package framework also extends to building and deploying regression models to predict continuous outcomes. All pipeline and ensemble methods discussed here also apply to regression, although some
build methods work only for classification.
Thank you for your interest in exprso. Although we have made tremendous progress in formalizing this library in a reliable package framework, some of the tools included here may change. To the best of knowledge, we have followed the machine learning "best practices" when developing this software, but if you know better than us, please let us know! File any and all issues at GitHub. In addition, we always welcome suggestions for new tools that we could include in future releases. Happy learning!
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.