screen_variant_mi: Mutual Information based feature screening of variants from a...

View source: R/screen_variant_mi.R

screen_variant_miR Documentation

Mutual Information based feature screening of variants from a mutation annotation file

Description

Mutual Information based feature screening of variants from a mutation annotation file

Usage

screen_variant_mi(
  maf,
  variant_col = "variant",
  cancer_col = "cancer",
  sample_id_col = "sample",
  equal_cancer_prob_mi = TRUE,
  return_prob_mi = TRUE,
  mi_rank_thresh = 250,
  normalize_mi = FALSE,
  do_freq_screen = FALSE,
  thresh_freq_screen = 1/length(unique(maf[[sample_id_col]])),
  ...
)

variant_screen_mi(
  maf,
  variant_col = "variant",
  cancer_col = "cancer",
  sample_id_col = "sample",
  equal_cancer_prob_mi = TRUE,
  return_prob_mi = TRUE,
  mi_rank_thresh = 250,
  normalize_mi = FALSE,
  do_freq_screen = FALSE,
  thresh_freq_screen = 1/length(unique(maf[[sample_id_col]])),
  ...
)

Arguments

maf

mutation annotation file – a data frame-like object with at least three columns containing variant labels, sample IDs, and cancer sites associated with the sample IDs. NOTE: uniqueness of rows of maf is assumed.

variant_col

name of the column in maf containing variant labels.

cancer_col

name of the column in maf that corresponds to cancer sites for the tumor samples.

sample_id_col

name of the column in maf containing tumor sample IDs.

equal_cancer_prob_mi

logical. Should the marginal probabilities of cancer sites be assumed equal (i.e., uniform) while computing mutual information? If FALSE, the relative frequencies of cancer sites in maf are used. CAUTION: the (sample) relative frequencies of cancer sites in maf may not necessarily be good approximations of the truth.

return_prob_mi

logical. Should the computed mutual information and the cancer site specific probabilities for these screened variants be returned? Defaults to TRUE.

mi_rank_thresh

rank threshold for screening variants. The top variants with rank(MI_values) <= mi_rank_thresh is returned. Defaults to 250.

normalize_mi

logical. Should mutual information be normalized by product of square-roots of marginal Shannon entropies? Defaults to FALSE.

do_freq_screen

logical. Should an overall (relative) frequency-based screening be performed prior to MI based screening? This may reduce the computation load substantially for whole genome data where potentially tens of millions of variants are observed only once. Defaults to FALSE.

thresh_freq_screen

Threshold for overall pan-cancer relative frequency to use if a frequency-based screening is performed before mi based screening. Defaults to 1/n_sample where n_sample is the pan-cancer total number of tumors. Ignored if do_freq_screen = FALSE.

...

Unused.

Details

The function first estimates via relative frequencies the cancer site specific probabilities of encountering EACH variant in the maf file. Then using these estimated probabilities and the marginal probabilities of cancer sites, the (possibly normalized) mutual information between (a) the occurrence of a variant-"j" in randomly chosen tumor and (b) the cancer site of the associated tumor is computed for each variant-j in maf. These MIs are then ranked and the variant labels associated with with mi rank <= mi_rank_thresh are returned.

Value

a character vector listing the screened variant labels (sorted with the first one having the highest MI) with ranks <= mi_rank_thresh. Optionally, if return_prob_mi = TRUE, then a data table named prob_mi listing cancer site specific probabilities of ALL variants and the associated MIs are returned.

Examples

data("impact")
top_v <- screen_variant_mi(
  maf = impact,
  variant_col = "Variant",
  cancer_col = "CANCER_SITE",
  sample_id_col = "patient_id",
  mi_rank_thresh = 200,
  return_prob_mi = FALSE
)
top_v



c7rishi/hidgenclassifier documentation built on June 14, 2024, 11:10 a.m.