calc_one_v_rest_auc: Calculating area under Precision-Recall curve (PRC) and...

View source: R/calc_one_v_rest_auc.R

calc_one_v_rest_aucR Documentation

Calculating area under Precision-Recall curve (PRC) and Receiver-Operator characteristic curve (ROC) for all one-vs-rest comparisons in the fitted model

Description

Calculating area under Precision-Recall curve (PRC) and Receiver-Operator characteristic curve (ROC) for all one-vs-rest comparisons in the fitted model

Usage

calc_one_v_rest_auc(
  fit = NULL,
  Xnew = NULL,
  Ynew = NULL,
  normalize_rows = NULL,
  measure = c("PRC", "ROC"),
  fitted_prob = NULL,
  include_baseline = TRUE,
  ...
)

Arguments

fit

fitted hidden genome classifier object. Experimental: can be NULL, in which case fitted_prob and Ynew must be provided.

Xnew, Ynew

New predictor design matrix and corresponding cancer site labels. If provided, the trained hidden genome model (supplied through fit) is used to obtain predicted probabilities based on Xnew and the resulting resulting probabilities are used as fitted_prob, along with Ynew to calculate the AUCs. If Xnew is supplied, then Ynew must also be supplied. If fitted_prob is supplied, then Xnew is ignored.

normalize_rows

vector of the same length as nrow(Xnew) to be used to normalize the rows of Xnew. If NULL (default), no normalization is performed.

measure

Type of curve to use. Options include "PRC" (Precision Recall Curve) and "ROC" (Receiver Operator characteristic Curve). Can be a vector.

fitted_prob

an n_tumor x n_cancer matrix of predicted classification probabilities of (corresponding to the "true" class labels provided in Ynew, if supplied, or the original training Y labels, as stored in the trained model) to use for calculating ROC/PRC AUCs, where n_tumor denotes the number of tumor/sample units, and n_cancer is the number of cancer sites in the fitted hidden genome model (supplied through "fit"). Row names and column names must be identical to the the tumor/sample names and cancer labels in Ynew (if supplied) or as used in the fitted model. If NULL (default) then the fitted probabilities are obtained from the model itself by either extracting pre-validated predictive probabilities (only available for mlogit models), or simply using the fitted model to make predictions on the training set.

include_baseline

logical. Along with the computed observed value(s) of the measure(s) should the null baseline value(s) be returned. Here null baseline refers to the expected value of the corresponding measure associated with a "baseline" classifier that (uniform) randomly assigns class labels to the sample units.

Details

Under the hood, the function uses several functions from R package precrec to compute the performance metrics. The argument fitted_prob, when supplied, should ideally contain predictive probabilities for training set tumors evaluated under a cross-validation framework. If not supplied, pre-validated prediction probabilities extracted from mlogit models, and overoptimistic prediction probabilities (obtained by simply using the fitted model on the training data) for other models are used.

Value

Returns a data.table with length(measure) + 1 columns ("Class" and measure(s)) (2 * length(measure) + 1 many columns if include_baseline = TRUE) and n_class + 1 many rows, where n_class denotes the number of cancer types present in the fitted model; the final row provides the Macro (average) metrics.

Note

The function uses package precrec under the hood to compute the AUCs. Please install precrec before using calc_one_v_rest_auc.

Examples

data("impact")
top_v <- variant_screen_mi(
  maf = impact,
  variant_col = "Variant",
  cancer_col = "CANCER_SITE",
  sample_id_col = "patient_id",
  mi_rank_thresh = 50,
  return_prob_mi = FALSE
)
var_design <- extract_design(
  maf = impact,
  variant_col = "Variant",
  sample_id_col = "patient_id",
  variant_subset = top_v
)

canc_resp <- extract_cancer_response(
  maf = impact,
  cancer_col = "CANCER_SITE",
  sample_id_col = "patient_id"
)
pid <- names(canc_resp)
# create five stratified random folds
# based on the response cancer categories
set.seed(42)
folds <- data.table::data.table(
  resp = canc_resp
)[,
  foldid := sample(rep(1:5, length.out = .N)),
  by = resp
]$foldid

# 80%-20% stratified separation of training and
# test set tumors
idx_train <- pid[folds != 5]
idx_test <- pid[folds == 5]

# train a classifier on the training set
# using only variants (will have low accuracy
# -- no meta-feature information used
fit0 <- fit_mlogit(
  X = var_design[idx_train, ],
  Y = canc_resp[idx_train]
)

calc_one_v_rest_auc(fit0)
calc_one_v_rest_auc(fit0, measure = "PRC")
calc_one_v_rest_auc(fit0, measure = "ROC")


c7rishi/hidgenclassifier documentation built on June 14, 2024, 11:10 a.m.