knitr::opts_chunk$set(
  collapse = TRUE, message=FALSE, warning=FALSE,
  comment = "#>"
)
knitr::opts_chunk$set(fig.width=12, fig.height=8)
knitr::opts_chunk$set(tidy.opts=list(width.cutoff=150),tidy=TRUE)
options(rgl.useNULL = TRUE)
options(warn=-1)
suppressMessages(library(dplyr))
set.seed(1)
options(knitr.table.format = "html")
library(OmicSelector)

Outline

One of the most important functionalities of OmicSelector is the ability to develop deep learning models. As OmicSelector focuses on biomarker feature selection and model development, it is one of the few solutions for developing deep feedforward neural networks (up to 3 hidden layers) with and without (sparse) autoencoder. OmicSelector provides both the framework and graphical interface to develop the best artificial neural network for molecular, laboratory, and clinical data. Please note, however, that OmicSelector was not designed to handle images or DICOM files. Multiple alternatives exist which handle imaging data.

Our solution's primary purpose is to develop the best classification tool when a limited number of samples are available (e.g., expression data; due to cost). The researcher would like to create a classifier resilient to overfitting.

This extensions provides a unified pipeline which utilizes TensorFlow though Keras to create feedforward neural networks. OmicSelector, however, doesn't require any knowledge of those technologies.

The extension needs to be loaded using:

OmicSelector::OmicSelector_load_extension("deeplearning")

This function loads the latest version of the extension from GitHub: https://github.com/kstawiski/OmicSelector/blob/master/extensions/deeplearning.R

OmicSelector_deep_learning function

This extension requires three datasets: training, testing, and validation datasets, as in Benchmarking (in standard OmicSelector pipeline). Those can be prepared using OmicSelector_prepare_split() or designed manually. The primary function for the training of neural networks is called OmicSelector_deep_learning(). This function aims to train several neural networks, test their performance, save models, and provide a general overview of the whole modeling.

OmicSelector_deep_learning() is defined with following parameters:

OmicSelector_deep_learning = function(selected_miRNAs = ".",
                                      wd = getwd(),
                                      SMOTE = F,
                                      keras_batch_size = 64,
                                      clean_temp_files = T,
                                      save_threshold_trainacc = 0.85,
                                      save_threshold_testacc = 0.8,
                                      keras_epochae = 5000,
                                      keras_epoch = 2000,
                                      keras_patience = 50,
                                      hyperparameters = expand.grid(
                                        layer1 = seq(3, 11, by = 2),
                                        layer2 = c(0, seq(3, 11, by = 2)),
                                        layer3 c(0, seq(3, 11, by = 2)),
                                        activation_function_layer1 = c("relu", "sigmoid"),
                                        activation_function_layer2 = c("relu", "sigmoid"),
                                        activation_function_layer3 = c("relu", "sigmoid"),
                                        dropout_layer1 = c(0, 0.1),
                                        dropout_layer2 = c(0, 0.1),
                                        dropout_layer3 = c(0),
                                        layer1_regularizer = c(T, F),
                                        layer2_regularizer = c(T, F),
                                        layer3_regularizer = c(T, F),
                                        optimizer = c("adam", "rmsprop", "sgd"),
                                        autoencoder = c(0, 7, -7),
                                        balanced = SMOTE,
                                        formula = as.character(OmicSelector_create_formula(selected_miRNAs))[3],
                                        scaled = c(T, F),
                                        stringsAsFactors = F
                                      ),
                                      add_features_to_predictions = F,
                                      keras_threads = ceiling(parallel::detectCores() /
                                                                2),
                                      start = 1,
                                      end = nrow(hyperparameters),
                                      output_file = "deeplearning_results.csv",
                                      save_all_vars = F)
OmicSelector_load_datamix()

General setup parameters:

Saving options:

Both training accuracy > save_threshold_trainacc and testing accuracy > save_threshold_testacc are required to consider the model useful. We did it to save storage space and do not save not working models. Please note, however, that the metrics of the models will be saved in output_file for further analysis.

Training control parameters:

OmicSelector trains a defined number of epochs (assuming that early stopping criteria are not met), but the final network is the one with the lowest validation loss.

Hyperparameters (grid search):

Hyperparameters data frame contains the information about hyperparameter sets we want to check in grid search for the best model. You can play with the following hyperparameters:

If you do not want to check all hyperparameter sets (all rows in hyperparameters dataset):

You can use default values if you don't know what to do. OmicSelector's GUI uses default values set by us.

In OmicSelector's GUI we use 3 presets of hyperparameters:

balanced = F # not balanced
selected_miRNAs = c("hsa.a","hsa.b","hsa.c") # selected features
hyperparameters = expand.grid(
  layer1 = seq(2, 10, by = 1),
  layer2 = c(0),
  layer3 = c(0),
  activation_function_layer1 = c("relu", "sigmoid", "selu"),
  activation_function_layer2 = c("relu"),
  activation_function_layer3 = c("relu"),
  dropout_layer1 = c(0, 0.1),
  dropout_layer2 = c(0),
  dropout_layer3 = c(0),
  layer1_regularizer = c(T, F),
  layer2_regularizer = c(F),
  layer3_regularizer = c(F),
  optimizer = c("adam", "rmsprop", "sgd"),
  autoencoder = c(0, -7, 7),
  balanced = balanced,
  formula = as.character(OmicSelector_create_formula(selected_miRNAs))[3],
  scaled = c(T, F),
  stringsAsFactors = F
)
DT::datatable(hyperparameters,
         extensions = c('FixedColumns',"FixedHeader"),
          options = list(scrollX = TRUE,
                         paging=TRUE))
hyperparameters_part1 = expand.grid(
  layer1 = seq(2, 10, by = 1),
  layer2 = c(0),
  layer3 = c(0),
  activation_function_layer1 = c("relu", "sigmoid", "selu"),
  activation_function_layer2 = c("relu"),
  activation_function_layer3 = c("relu"),
  dropout_layer1 = c(0, 0.1),
  dropout_layer2 = c(0),
  dropout_layer3 = c(0),
  layer1_regularizer = c(T, F),
  layer2_regularizer = c(F),
  layer3_regularizer = c(F),
  optimizer = c("adam", "rmsprop", "sgd"),
  autoencoder = c(0),
  balanced = balanced,
  formula = as.character(OmicSelector_create_formula(selected_miRNAs))[3],
  scaled = c(T, F),
  stringsAsFactors = F
)
hyperparameters_part2 = expand.grid(
  layer1 = seq(3, 11, by = 2),
  layer2 = c(seq(3, 11, by = 2)),
  layer3 = c(seq(0, 11, by = 2)),
  activation_function_layer1 = c("relu", "sigmoid", "selu"),
  activation_function_layer2 = c("relu", "sigmoid", "selu"),
  activation_function_layer3 = c("relu", "sigmoid", "selu"),
  dropout_layer1 = c(0, 0.1),
  dropout_layer2 = c(0),
  dropout_layer3 = c(0),
  layer1_regularizer = c(T, F),
  layer2_regularizer = c(F),
  layer3_regularizer = c(F),
  optimizer = c("adam", "rmsprop", "sgd"),
  autoencoder = c(0),
  balanced = balanced,
  formula = as.character(OmicSelector_create_formula(selected_miRNAs))[3],
  scaled = c(T, F),
  stringsAsFactors = F
)
hyperparameters = rbind(hyperparameters_part1, hyperparameters_part2)
hyperparameters_part1 = expand.grid(
  layer1 = seq(2, 10, by = 1),
  layer2 = c(0),
  layer3 = c(0),
  activation_function_layer1 = c("relu", "sigmoid", "selu"),
  activation_function_layer2 = c("relu"),
  activation_function_layer3 = c("relu"),
  dropout_layer1 = c(0, 0.1),
  dropout_layer2 = c(0),
  dropout_layer3 = c(0),
  layer1_regularizer = c(T, F),
  layer2_regularizer = c(F),
  layer3_regularizer = c(F),
  optimizer = c("adam", "rmsprop", "sgd"),
  autoencoder = c(0, -7, 7),
  balanced = balanced,
  formula = as.character(OmicSelector_create_formula(selected_miRNAs))[3],
  scaled = c(T, F),
  stringsAsFactors = F
)
hyperparameters_part2 = expand.grid(
  layer1 = seq(3, 11, by = 2),
  layer2 = c(seq(3, 11, by = 2)),
  layer3 = c(seq(0, 11, by = 2)),
  activation_function_layer1 = c("relu", "sigmoid", "selu"),
  activation_function_layer2 = c("relu", "sigmoid", "selu"),
  activation_function_layer3 = c("relu", "sigmoid", "selu"),
  dropout_layer1 = c(0, 0.1),
  dropout_layer2 = c(0),
  dropout_layer3 = c(0),
  layer1_regularizer = c(T, F),
  layer2_regularizer = c(F),
  layer3_regularizer = c(F),
  optimizer = c("adam", "rmsprop", "sgd"),
  autoencoder = c(0, -7, 7),
  balanced = balanced,
  formula = as.character(OmicSelector_create_formula(selected_miRNAs))[3],
  scaled = c(T, F),
  stringsAsFactors = F
)
hyperparameters = rbind(hyperparameters_part1, hyperparameters_part2) 

Training with grid search

The network's standard training is regulated by the structure (defined by current hyperparameters). The data frame hyperparameters defines which hyperparameter sets are to be checked in the training process. This is a grid search, so every set of hyperparameters will be fit in this process.

The function starts with looking for the current working directory data (defined as wd). It expects the files to be named as mixed_train.csv (training set), mixed_test.csv (testing set), mixed_valid.csv (validation set). The file should contain binary Class variable (with values Case and Control) and features of interest (starting with prefix hsa). Neural networks are trained with early stopping. As the neural network is trained based on ROC analysis, the cutoff is being chosen. We use the cutoff with the maximum value of Youden index (Youden's J statistic = sensitivity + specificity - 1). If the predicted probability is greater or equal to the cutoff, the case is predicted as Case. Otherwise, it is considered to be a Control.

Let's create those files from TCGA data.

data("orginal_TCGA_data")
suppressWarnings(suppressMessages(library(dplyr)))
cancer_cases = filter(orginal_TCGA_data, primary_site == "Pancreas" & sample_type == "PrimaryTumor")
control_cases = filter(orginal_TCGA_data, sample_type == "SolidTissueNormal")
cancer_cases$Class = "Case"
control_cases$Class = "Control"
dataset = rbind(cancer_cases, control_cases)
ttpm = OmicSelector_counts_to_log10tpm(danex = dplyr::select(dataset, starts_with("hsa")),
                                metadane = dplyr::select(dataset, -starts_with("hsa")),
                                ids = dataset$sample, filtr = F,
                                filtr_minimalcounts = 1,
                                filtr_howmany = 0.01)
ttpm = as.data.frame(ttpm)
zero_var = which(apply(ttpm, 2, var) == 0) #which have no variance
ttpm = ttpm[,-zero_var]
# Not run:
# DE = OmicSelector_differential_expression_ttest(ttpm_features = dplyr::select(dataset, starts_with("hsa")), classes = dataset$Class, mode = "logtpm")
# significant = DE$miR[DE$`p-value Bonferroni`<0.05]
selected_miRNAs = make.names(c('hsa-miR-192-5p', 'hsa-let-7g-5p', 'hsa-let-7a-5p', 'hsa-miR-194-5p', 'hsa-miR-122-5p', 'hsa-miR-340-5p', 'hsa-miR-26b-5p')) # some selected miRNAs
match(selected_miRNAs, colnames(ttpm))
#dataset = dataset[sample(1:nrow(dataset),200),] # sample 100 random cases to make it quicker
OmicSelector_table(table(dataset$Class), col.names = c("Class", "Number of cases"))
# For full analysis:
# merged = OmicSelector_prepare_split(metadane = dplyr::select(dataset, -starts_with("hsa")), ttpm = ttpm)

merged = OmicSelector_prepare_split(metadane = dplyr::select(dataset, -starts_with("hsa")), ttpm = dplyr::select(ttpm, selected_miRNAs))
knitr::kable(table(merged$Class, merged$mix))

Let's train just 5 neural networks with 1 hidden layer (for the sake of this quick tutorial):

OmicSelector_load_datamix()
SMOTE = F # use unbalanced set
deep_learning_results = OmicSelector_deep_learning(selected_miRNAs = selected_miRNAs, start = 5, end = 10) # use default set of hyperparameters and options
DT::datatable(deep_learning_results, 
          extensions = c('FixedColumns',"FixedHeader"),
           options = list(scrollX = TRUE, 
                          paging=FALSE,
                          fixedHeader=TRUE))
deep_learning_results = data.table::fread("deeplearning_results.cs")
DT::datatable(deep_learning_results, 
          extensions = c('FixedColumns',"FixedHeader"),
           options = list(scrollX = TRUE, 
                          paging=TRUE
                          ))

Deep learning results were also saved as deeplearning_results.csv. This file contains the following variables:

We prefer to choose the best model based on the highest metaindex value. Metaindex is the average of all accuracy metrics (on training, testing, and validation sets), but the final decision is arbitrary.

Utilizing the networks for predictions

Deep neural networks created using OmicSelector can be used for prediction on new datasets using:

OmicSelector_deep_learning_predict = function(model_path = "our_models/model5.zip",
                                              new_dataset = data.table::fread("Data/ks_data.csv"),
                                              new_scaling = F,
                                              old_train_csv_to_restore_scaling = NULL,
                                              override_cutoff = NULL,
                                              blinded = F)

Input parameters:

Output (list):

You can see the function working in the scoring tool inside OmicSelector's GUI. If you are interested in code, look here.

Sesssion

sessionInfo()
packageDescription("OmicSelector")

To render this tutorial we used:

render("DeepLearningTutorial.Rmd", output_file = "DeepLearningTutorial.html", output_dir = "../inst/doc/")

Packages installed in our docker enviorment:

OmicSelector_table(as.data.frame(installed.packages()))

Clean the temporary and model files (as the tutorial results are simplified, and we do not need them).

unlink("temp", recursive=TRUE)
unlink("models", recursive=TRUE)
unlink("task.log")
unlink("mixed*.csv")


kstawiski/OmicSelector documentation built on April 10, 2024, 11:11 p.m.