knitr::opts_chunk$set(echo = TRUE)
In this vignette, we illustrate how to train a hidden genome (multinomial logistic) classifier on MSK-IMPACT tumors from 10 cancer sites using the R package hidgenclassifier
.
We start by setting a seed, loading the hidgenclassifier
and magrittr
packages (the latter being used for its pipe %>%
operator), and importing the impact
data from hidgenclassifier
.
set.seed(42) library(magrittr) library(hidgenclassifier) data("impact")
The impact mutation annotation dataset (stored as a data.table
object) looks like the following:
impact
n_gene <- length(unique(impact$Hugo_Symbol)) n_pid <- length(unique(impact$patient_id))
The dataset consists of r nrow(impact)
rows and r ncol(impact)
columns and catalogs somatic mutations (column "Variant") detected at r n_gene
targeted cancer genes across r n_pid
tumors from 10 cancer sites. The cancer sites are listed in the column "CANCER_SITE":
unique(impact$CANCER_SITE)
The sample/patient ids of tumors are listed in the column "patient_id". Enter ?impact
in the R console to see more details on the dataset. We want to train a hidden genome (multinomial logistic) classifier with the above 10 cancer sites (response classes) using the variants listed in column "Variant" as predictor while simultaneously utilizing the meta-features gene (column "Hugo_Symbol") and 96 Single Base Substitution (SBS-96) categories.
We first extract the response cancer classes, labeled by the patient (tumor) ids in the impact
dataset, which we store in the variable pid
.
canc_resp <- extract_cancer_response( maf = impact, cancer_col = "CANCER_SITE", sample_id_col = "patient_id" ) pid <- names(canc_resp)
To train the model, we first split the dataset into 5 stratified random folds, based the cancer categories. Then we define our training set by combining four out of the five folds, and use the remaining fifth fold as our test set.
set.seed(42) folds <- data.table::data.table( resp = canc_resp )[, foldid := sample(rep(1:5, length.out = .N)), by = resp ]$foldid # 80%-20% stratified separation of training and # test set tumors pid_train <- pid[folds != 5] pid_test <- pid[folds == 5]
To fit a hidden genome classifier, we need to (a) obtain the variant design matrix (X), (b) compute the meta-feature product design-meta-design matrices (XU), (c) column-bind these X and XU matrices, and (d) normalize each row of the resulting column-bound matrix by the square-root of the total mutation burden observed in that tumor (row). Note that because the XU matrix combines/condenses information from all variants (including less informative rare individual variants), given XU we can filter out the less informative/discriminative columns from X. That is, we can do a feature screening of the columns of X before using it as predictor in the hidden genome model, after we have computed XU.
A mutual information (MI) based feature screening is implemented within the function screen_variant_mi
, which we now use to get the most discriminative variants with MI rank $\leq$ 250 (stored in the variable top_v
in the following). Note that the screening must be done on the training set, which is ensured by subsetting the impact data to rows corresponding to patient_id %in% pid_train
while the maf file impact
passing into screen_variant_mi
:
top_v <- variant_screen_mi( maf = impact[patient_id %in% pid_train], variant_col = "Variant", cancer_col = "CANCER_SITE", sample_id_col = "patient_id", mi_rank_thresh = 250, return_prob_mi = FALSE, do_freq_screen = FALSE )
Note that by default do_freq_screen
is set to FALSE
; if do_freq_screen = TRUE
, then an overall (relative) frequency-based screening is performed prior to MI based screening. This may reduce the computation load substantially for whole genome datasets where potentially tens of millions of variants, each with little individual discriminative information, are observed only once. The relative frequence threshold can be set by thresh_freq_screen
(defaults to 1/n_sample where n_sample is the pan-cancer total number of tumors.)
With the most discriminative variants determined and stored in top_v
, we now extract the variant design matrix X restricted to the variants in top_v
and for all tumors. (The matrix will be row-subsetted to pid_train
during training; the remaining rows will be used for prediction):
X_variant <- extract_design( maf = impact, variant_col = "Variant", sample_id_col = "patient_id", variant_subset = top_v ) dim(X_variant)
Next we compute the XU matrix (for all tumors) for the meta-feature gene. This can be obtained via the function extract_design_mdesign_mcat
, by specifying the meta-feature column to be "Hugo_Symbol":
XU_gene <- extract_design_mdesign_mcat( maf = impact, variant_col = "Variant", mfeat_col = "Hugo_Symbol", sample_id_col = "patient_id", mfeat_subset = NULL ) %>% magrittr::set_colnames( paste("Gene_", colnames(.)) ) dim(XU_gene)
(the column names are appended by the prefix "Gene_" for easier identification of predictors in the fitted models). Note that by supplying an appropriate (non-NULL
) mfeat_subset
, the computation in extract_design_mdesign_mcat
can be restricted to a specific subset of genes. This is useful when analyzing whole-exome and whole-genome datasets.
Now we compute the XU matrix for the SBS-96 meta-feature. This is done by supplying various nucleotide change specific columns from the maf file (impact
) to the function extract_design_mdesign_sbs96
.
XU_sbs96 <- extract_design_mdesign_sbs96( maf = impact, chromosome_col = "Chromosome", start_position_col = "Start_Position", end_position_col = "End_Position", ref_col = "Reference_Allele", alt_col = "Tumor_Seq_Allele2", sample_id_col = "patient_id" ) %>% magrittr::set_colnames( paste("SBS_", colnames(.)) ) dim(XU_sbs96)
(the column names are appended by the prefix "SBS_" for easier identification of predictors in the fitted models). Note that the function extract_design_mdesign_sbs96
calls various functions from SomaticSignatures
and other Bioconductor packages under the hood, which
uses various genomic datasets, and also overwrites a few default S3 methods. If overwriting of these functions (see above) is a concern, we recommend computing XU_sbs96
in a non-interactive R session and saving the result as an R data object (using, say, saveRDS
), or refreshing the R session after computing the above matrix in an interactive R session.
Finally, we compute the total mutation burden per tumor, labeled by tumor (patient) ids, using the function extract_tmb
:
tmb <- extract_tmb( maf = impact, variant_col = "Variant", sample_id_col = "patient_id" )
We are now in a position to create the predictor matrix for the hidden genome model, by column-binding all X and XU matrices, and subsequently normalizing the rows of the column-bound matrix by the square-root of the total mutation burdens in the tumor; the resulting entries correspond to various scalar projections as described in the manuscript. We use the convenience function divide_rows
for this normalization inside a chained (using magrittr
pipe) computation steps.
predictor_mat <- cbind( X_variant[pid, ], XU_gene[pid, ], XU_sbs96[pid, ], tmb = tmb[pid] ) %>% divide_rows(sqrt(tmb[pid]))
(Note that the tmb
corresponds to the column of XU
associated with an "intercept" meta-feature of all 1's.) The resulting predictor_mat
will be used as the predictor matrix in the hidden genome model.
We use the function fit_mlogit
to fit a hidden genome multinomial logistic classifier with predictor_mat
as the predictor matrix, and canc_resp
as the response cancer classes. The fitting is restricted to the training set tumors pid_train
. The function takes a while to compute (took about ~30 minutes on a Windows 10 computer with 16GB RAM, 4 cores), so we recommend saving the result into a file once the function stops. Enter ?fit_mlogit
in the R console to see a detailed description of the arguments of fit_mlogit
.
fit_impact <- fit_mlogit( X = predictor_mat[pid_train, ], Y = canc_resp[pid_train] )
To predict cancer sites of the test set tumors based on the above fitted hidden genome model, we simply use the function predict_mlogit
:
pred_impact <- predict_mlogit( fit = fit_impact, Xnew = predictor_mat[pid_test, ] )
This creates a list with entries (a) probs_predicted
: a n_test_tumor by n_cancer matrix of multinomial probabilities, providing the predicted probability of each test set tumor being classified into each cancer site, and (b) predicted
: a character vector listing hard classes based on the predicted multinomial probabilities (obtained by assigning tumors to the classes with the highest predicted probabilities).
Rigorous quantification of individual predictor effects can be obtained through odds-ratios from a fitted multinomial logistic regression model. We consider one-vs-rest odds ratio of a tumor being classified into a specific cancer category, relative to not being classified into that category, for one standard deviation change in each predictor from its mean, while keeping all other predictors fixed at their respective means. This can be obtained using the function odds_ratio_mlogit
:
or <- odds_ratio_mlogit( fit = fit_impact, type = "one-vs-rest", log = TRUE )
Note that odds ratios are computed by default in a log scale.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.