Foreword

Although this vignette contains a lot of exciting stuff, to get the most out of it, we recommend first reading the companion vignette, "An Introduction to the exprso Package", which introduces many of the core features applied here.

Tidy learning

All exprso modules support piping with the magrittr package. Piping is handled by two key functions, %>% and %T>%, that pass the result from a prior function call to the first argument of the next function call. However, the latter differs from the former in that it "branches out" and does not pass on its own result. Instead, %T>% pipes along the result from the previous function, making it useful for "side-chain" tasks like plotting. First, let us load some data.

set.seed(1)
library(exprso)
library(magrittr)
data(iris)
array <- exprso(iris[1:100, 1:4], iris[1:100, 5])

Below, we use the %>% function to pre-process the data and split it into a training and test set. Since the data object forks at the level of the split method (yielding two ExprsArray objects from one), it makes sense to break the pipe cascade there.

splitSets <- array %>%
  modTransform %>% modNormalize %>%
  splitSample(percent.include = 67)

Next, we use the %>% function to pull the training set from the split method result (via the trainingSet function) and pipe it through a chain of feature selection and classifier construction methods. Similar to trainingSet, the testSet function (or, equivalently, the validationSet function) will extract the test set from the split method result.

pred <- trainingSet(splitSets) %>%
  fsStats(how = "t.test") %>%
  fsPrcomp(top = 2) %T>%
  plot(c = 0) %>%
  buildSVM %>%
  predict(testSet(splitSets)) %T>%
  calcStats

Piping can expedite ensemble classifier construction as well. Here, we use the %>% function in conjunction with plMonteCarlo to (a) split the training set across 10 bootstraps, (b) perform a feature selection on each training subset, (c) construct an LDA classifier, and (d) deploy the classifier on an internal validation set. Then, we select the best three performing classifiers, regardless of the bootstrap origin, by passing the results through pipeUnboot and pipeFilter (see ?pipeUnboot and ?pipeFilter to learn more about how the "boot" column changes pipeFilter behavior). Last, we build a classifier ensemble and deploy it on the test set. For code clarity, we define the argument handler functions ctrlSplitSet, ctrlFeatureSelect, and ctrlGridSearch outside of the pipe cascade.

ss <- ctrlSplitSet(func = "splitSample", percent.include = 67, replace = TRUE)
fs <- ctrlFeatureSelect(func = "fsStats", top = 0)
gs <- ctrlGridSearch(func = "plGrid", top = 0, how = "buildLDA")

pred <- trainingSet(splitSets) %>%
  plMonteCarlo(B = 10, ctrlSS = ss, ctrlFS = fs, ctrlGS = gs) %>%
  pipeUnboot %>%
  pipeFilter(colBy = "valid.auc", top = 3) %>%
  buildEnsemble %>%
  predict(testSet(splitSets)) %T>%
  calcStats

Clustering before classifying

The exprso package also includes an experimental function, modCluster, that clusters subjects prior to building models. This function uses the how argument to toggle between one of seven clustering algorithms and returns an ExprsArray object with updated @annot slot that contains the results of clustering.

pred <- trainingSet(splitSets) %>%
  modCluster(top = 0, how = "hclust", k = 4) %>%
  modSubset(colBy = "cluster", include = 1)

Next, we show how to make a custom training set based on a cluster of cases and all controls. For this, we use modCluster in conjunction with the conjoin function. Note that using conjoin after feature selection will throw an error. Although we cluster cases here, this technique would work for any data annotation.

clusteredCases <- trainingSet(splitSets) %>%
  modSubset(colBy = "defineCase", include = "Case") %>%
  modCluster %>%
  modSubset(colBy = "cluster", include = 1) %>%
  conjoin(trainingSet(splitSets) %>%
            modSubset(colBy = "defineCase", include = "Control"))

Deep learning

Deep learning in exprso does not differ much from the other approaches to classification. However, supplying arguments can get cumbersome.

pred <- trainingSet(splitSets) %>%
  buildDNN(top = 0,
           activation = "TanhWithDropout", # or 'Tanh'
           input_dropout_ratio = 0.2, # % of inputs dropout
           hidden_dropout_ratios = c(0.5,0.5,0.5), # % for nodes dropout
           balance_classes = TRUE,
           hidden = c(50,50,50), # three layers of 50 nodes
           epochs = 100) %>%
  predict(testSet(splitSets)) %T>%
  calcStats

One important difference with buildDNN is that you must manually clear the old classification models from RAM. Unlike with other models, the ExprsModel object does not actually store the deep neural net, but rather just holds a "link" to the actual classifier which is stored outside of R.

h2o::h2o.shutdown() # frees up RAM for more learning

When embedding buildDNN within a grid-search, we run into the difficulty that most buildDNN arguments require a numeric vector as input. These vector inputs typically correspond to a unique value for each layer of the deep neural net. We can provide plGrid a vector argument by wrapping it in a list. Take note that this approach of providing argument vectors in a list would also work for other arguments (e.g., character arguments to top).

pl <- trainingSet(splitSets) %>%
  plGrid(array.valid = testSet(splitSets), top = 0,
         how = "buildDNN", fold = NULL,
         activation = "TanhWithDropout", # or 'Tanh'
         input_dropout_ratio = 0.2, # % of inputs dropout
         hidden_dropout_ratios = list(c(0.5,0.5,0.5)), # % for nodes dropout
         balance_classes = TRUE,
         hidden = list(c(50,50,50)), # three layers of 50 nodes
         epochs = 100)

Below, we show a more elaborate deep learning grid-search. For details on these arguments, see ?h2o::h2o.deeplearning.

pl <- trainingSet(splitSets) %>%
  plGrid(array.valid = testSet(splitSets), top = 0,
         how = "buildDNN", fold = NULL,
         activation = c("Rectifier",
                        "TanhWithDropout"), # or 'Tanh'
         input_dropout_ratio = c(0.2,
                                 0.5,
                                 0.8), # % of inputs dropout
         hidden_dropout_ratios = list(c(0.5,0.5,0.5),
                                      c(0.2,0.2,0.2)), # % for nodes dropout
         balance_classes = TRUE,
         hidden = list(c(50,50,50),
                       c(100,100,100),
                       c(200,200,200)), # three layers of 50 nodes
         epochs = c(100))

Keep in mind that deep learning is a very RAM hungry task. If you're not careful, you'll run out RAM and throw an error. Remember to call h2o::h2o.shutdown() whenever you finish!

Perfect cross-validation

The "perfect" cross-validation pipeline would have two layers of cross-validation such that the outer-layer divides the data without feature selection while the inner-layer divides the data with feature selection. We can achieve this in exprso by embedding a plNested pipeline within another plNested pipeline.

If you use this approach, you should not need a test set as long as you calculate classification accuracy in a way that respects the independence of each fold. In other words, if you opt out of a test set, you must never let a validation set accuracy guide the selection of which training sets to use when calculating the final classification accuracy.

For illustrative purposes, we perform "perfect" cross-validation using support vector machines built across a single set of parameters. Extending this pipeline to a larger parameter grid-search will require a cautious analysis of the results.

fs.inner <- ctrlFeatureSelect(func = "fsStats", top = 0, how = "t.test")
gs.inner <- ctrlGridSearch(func = "plGrid", top = 3,
                           how = "buildSVM", fold = NULL)

fs.outer <- ctrlFeatureSelect(func = "fsNULL", top = 0)
gs.outer <- ctrlGridSearch(func = "plNested", fold = 2,
                           ctrlFS = fs.inner, ctrlGS = gs.inner)

pl <- array %>%
  modTransform %>% modNormalize %>%
  plNested(fold = 2, ctrlFS = fs.outer, ctrlGS = gs.outer)

Cross-validation variants

Typically, we summarize rounds of cross-validation using accuracy. However, we could conceive of situations where we might want to emphasize sensitivity over specificity (or vice versa). Below, we show how we can use plNested in lieu of plCV to select plMonteCarlo bootstraps based on sensitivity.

In this example, each iteration of plMonteCarlo will split the dataset, then call plNested on the training subset. Next, plNested will manage $v$-fold cross-validation, splitting the data into $v$ equal folds. Finally, each fold will undergo a grid-search according to plGrid. Since we have chosen to use plNested in lieu of plCV, we disable plCV by setting the plGrid argument fold = NULL. Note that, as above, we only perform feature selection within the inner-layer. This ensures that the outer-layer serves as a truly independent validation set.

fs.inner <- ctrlFeatureSelect(func = "fsStats", top = 0, how = "t.test")
gs.inner <- ctrlGridSearch(func = "plGrid", top = c(2, 3, 4),
                           how = "buildSVM", fold = NULL)

ss.outer <- ctrlSplitSet(func = "splitStratify", percent.include = 67)
fs.outer <- ctrlFeatureSelect(func = "fsNULL", top = 0)
gs.outer <- ctrlGridSearch(func = "plNested", fold = 10,
                           ctrlFS = fs.inner, ctrlGS = gs.inner)

pl <- array %>%
  modTransform %>% modNormalize %>%
  plMonteCarlo(B = 5, ctrlSS = ss.outer, ctrlFS = fs.outer, ctrlGS = gs.outer)

The resultant object now contains the necessary information to rank plMonteCarlo bootstraps based on $v$-fold cross-validation sensitivity or specificity. Alternatively, we could aggregate the results by selecting the best fold from each bootstrap using pipeFilter, emphasizing sensitivity over specificity with colBy.

top <-
  pipeFilter(pl, colBy = c("valid.sens", "valid.sens", "valid.spec"), top = 1)


tpq/exprso documentation built on July 27, 2019, 8:44 a.m.