fit_nnc: Train a fully-connected multi-class neural network

View source: R/fit_predict_nnc.R

fit_nncR Documentation

Train a fully-connected multi-class neural network

Description

This function first splits the data into a training and validation set and tunes hyperparameters using Bayesian optimization (similar to the approach used in Jiao et al. 2020), then uses the best hyperparameters to train a model on the entire dataset.

Usage

fit_nnc(
  X,
  Y,
  val_split = 1/3,
  trials = 200,
  epochs = 50,
  batch_size = 128,
  verbose_mbo = T,
  seed = 1
)

fit_nn(
  X,
  Y,
  val_split = 1/3,
  trials = 200,
  epochs = 50,
  batch_size = 128,
  verbose_mbo = T,
  seed = 1
)

Arguments

X

data design matrix with observations across rows and predictors across columns. For a typical hidden genome classifier each row represents a tumor and the columns represent (possibly normalized by some functions of the total mutation burden in tumors) binary 1-0 presence/absence indicators of raw variants, counts of mutations at specific genes and counts of mutations corresponding to specific mutation signatures etc.

Y

character vector or factor denoting the cancer type of tumors whose mutation profiles are listed across the rows of X.

val_split

Fraction of data to be used as validation set for hyperparameters

trials

Number of trials for hyperparameter tuning

epochs

Number of training epochs

verbose_mbo

Bayesian optimization verbosity mode (logical)

seed

Random seed

...

Unused

Value

Object of class "nn", a named list of length 7 with the components of the neural network training process

X

Input matrix

Y

Response vector

map_df

Dataframe with columns "original" and "numeric". The "original" column contains the original class names in Y and the "numeric" column contains the numeric representation of the classes used during training

model

Final Keras model trained on X and Y (see https://keras.rstudio.com/articles/about_keras_models.html for more details)

ind_val

Vector of indices of X corresponding to validation set used to tune hyperparameters

tuning_results

Named list with the results from the hyperparameter search (output of mbo() from mlrMBO). The list elements include "x", a named list with the best hyperparameters found, and "y", the validation accuracy corresponding to the best hyperparameters. See description of MBOSingleObjResult from mlrMBO for more details.

preproc

Named list with the parameters of the min-max pre-processing transformation applied to X prior to training (output of preProcess() from caret)

Note

  1. The function uses packages keras and tensorflow for fitting neurual networks, which requires a python environment in the backend. See the installation notes for the keras R package for more details.

  2. In addition to keras and tensorflow the function makes use of several functions from packages caret, mlrMBO, lhs, ParamHelpers, smoof, and mlr under the hood. These packages must be installed separately before using fit_nnc.

Author(s)

Zoe Guan. Email: guanZ@mskcc.org

References

Jiao W, Atwal G, Polak P, Karlic R, Cuppen E, Danyi A, De Ridder J, van Herpen C, Lolkema MP, Steeghs N, Getz G. A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nature communications. 2020 Feb 5;11(1):1-2.

Examples

data("impact")
top_v <- variant_screen_mi(
  maf = impact,
  variant_col = "Variant",
  cancer_col = "CANCER_SITE",
  sample_id_col = "patient_id",
  mi_rank_thresh = 50,
  return_prob_mi = FALSE
)
var_design <- extract_design(
  maf = impact,
  variant_col = "Variant",
  sample_id_col = "patient_id",
  variant_subset = top_v
)

canc_resp <- extract_cancer_response(
  maf = impact,
  cancer_col = "CANCER_SITE",
  sample_id_col = "patient_id"
)
pid <- names(canc_resp)
# create five stratified random folds
# based on the response cancer categories
set.seed(42)
folds <- data.table::data.table(
  resp = canc_resp
)[,
  foldid := sample(rep(1:5, length.out = .N)),
  by = resp
]$foldid

# 80%-20% stratified separation of training and
# test set tumors
idx_train <- pid[folds != 5]
idx_test <- pid[folds == 5]

## Not run: 
# train a classifier on the training set
# using only variants (will have low accuracy
# -- no meta-feature information used
fit0 <- fit_nnc(
  X = var_design[idx_train, ],
  Y = canc_resp[idx_train],
  trials = 10,
  epochs = 5
)

pred0 <- predict_nnc(
  fit = fit0,
  Xnew = var_design[idx_test, ]
)


## End(Not run)


c7rishi/hidgenclassifier documentation built on June 14, 2024, 11:10 a.m.