fit_nnc: Train a fully-connected multi-class neural network
In c7rishi/hidgenclassifier: Functions for Bayesian hierarchical hidden genome classifier

fit_nnc

R Documentation

Train a fully-connected multi-class neural network

Description

This function first splits the data into a training and validation set and tunes hyperparameters using Bayesian optimization (similar to the approach used in Jiao et al. 2020), then uses the best hyperparameters to train a model on the entire dataset.

Usage

fit_nnc(
  X,
  Y,
  val_split = 1/3,
  trials = 200,
  epochs = 50,
  batch_size = 128,
  verbose_mbo = T,
  seed = 1
)

fit_nn(
  X,
  Y,
  val_split = 1/3,
  trials = 200,
  epochs = 50,
  batch_size = 128,
  verbose_mbo = T,
  seed = 1
)

Arguments

`X`	data design matrix with observations across rows and predictors across columns. For a typical hidden genome classifier each row represents a tumor and the columns represent (possibly normalized by some functions of the total mutation burden in tumors) binary 1-0 presence/absence indicators of raw variants, counts of mutations at specific genes and counts of mutations corresponding to specific mutation signatures etc.
`Y`	character vector or factor denoting the cancer type of tumors whose mutation profiles are listed across the rows of `X`.
`val_split`	Fraction of data to be used as validation set for hyperparameters
`trials`	Number of trials for hyperparameter tuning
`epochs`	Number of training epochs
`verbose_mbo`	Bayesian optimization verbosity mode (logical)
`seed`	Random seed
`...`	Unused

Value

Object of class "nn", a named list of length 7 with the components of the neural network training process

`X`	Input matrix
`Y`	Response vector
`map_df`	Dataframe with columns "original" and "numeric". The "original" column contains the original class names in Y and the "numeric" column contains the numeric representation of the classes used during training
`model`	Final Keras model trained on X and Y (see https://keras.rstudio.com/articles/about_keras_models.html for more details)
`ind_val`	Vector of indices of X corresponding to validation set used to tune hyperparameters
`tuning_results`	Named list with the results from the hyperparameter search (output of mbo() from mlrMBO). The list elements include "x", a named list with the best hyperparameters found, and "y", the validation accuracy corresponding to the best hyperparameters. See description of MBOSingleObjResult from mlrMBO for more details.
`preproc`	Named list with the parameters of the min-max pre-processing transformation applied to X prior to training (output of preProcess() from caret)

Note

The function uses packages keras and tensorflow for fitting neurual networks, which requires a python environment in the backend. See the installation notes for the keras R package for more details.
In addition to keras and tensorflow the function makes use of several functions from packages caret, mlrMBO, lhs, ParamHelpers, smoof, and mlr under the hood. These packages must be installed separately before using fit_nnc.

Author(s)

Zoe Guan. Email: guanZ@mskcc.org

References

Jiao W, Atwal G, Polak P, Karlic R, Cuppen E, Danyi A, De Ridder J, van Herpen C, Lolkema MP, Steeghs N, Getz G. A deep learning system accurately classifies primary and metastatic cancers using passenger mutation patterns. Nature communications. 2020 Feb 5;11(1):1-2.

Examples

data("impact")
top_v <- variant_screen_mi(
  maf = impact,
  variant_col = "Variant",
  cancer_col = "CANCER_SITE",
  sample_id_col = "patient_id",
  mi_rank_thresh = 50,
  return_prob_mi = FALSE
)
var_design <- extract_design(
  maf = impact,
  variant_col = "Variant",
  sample_id_col = "patient_id",
  variant_subset = top_v
)

canc_resp <- extract_cancer_response(
  maf = impact,
  cancer_col = "CANCER_SITE",
  sample_id_col = "patient_id"
)
pid <- names(canc_resp)
# create five stratified random folds
# based on the response cancer categories
set.seed(42)
folds <- data.table::data.table(
  resp = canc_resp
)[,
  foldid := sample(rep(1:5, length.out = .N)),
  by = resp
]$foldid

# 80%-20% stratified separation of training and
# test set tumors
idx_train <- pid[folds != 5]
idx_test <- pid[folds == 5]

## Not run: 
# train a classifier on the training set
# using only variants (will have low accuracy
# -- no meta-feature information used
fit0 <- fit_nnc(
  X = var_design[idx_train, ],
  Y = canc_resp[idx_train],
  trials = 10,
  epochs = 5
)

pred0 <- predict_nnc(
  fit = fit0,
  Xnew = var_design[idx_test, ]
)


## End(Not run)

c7rishi/hidgenclassifier documentation built on June 14, 2024, 11:10 a.m.