View source: R/screen_variant_mi.R
screen_variant_mi | R Documentation |
Mutual Information based feature screening of variants from a mutation annotation file
screen_variant_mi(
maf,
variant_col = "variant",
cancer_col = "cancer",
sample_id_col = "sample",
equal_cancer_prob_mi = TRUE,
return_prob_mi = TRUE,
mi_rank_thresh = 250,
normalize_mi = FALSE,
do_freq_screen = FALSE,
thresh_freq_screen = 1/length(unique(maf[[sample_id_col]])),
...
)
variant_screen_mi(
maf,
variant_col = "variant",
cancer_col = "cancer",
sample_id_col = "sample",
equal_cancer_prob_mi = TRUE,
return_prob_mi = TRUE,
mi_rank_thresh = 250,
normalize_mi = FALSE,
do_freq_screen = FALSE,
thresh_freq_screen = 1/length(unique(maf[[sample_id_col]])),
...
)
maf |
mutation annotation file – a data frame-like object with at least three columns containing variant labels, sample IDs, and cancer sites associated with the sample IDs. NOTE: uniqueness of rows of maf is assumed. |
variant_col |
name of the column in |
cancer_col |
name of the column in |
sample_id_col |
name of the column in |
equal_cancer_prob_mi |
logical. Should the marginal probabilities of
cancer sites be assumed equal (i.e., uniform) while computing mutual
information? If |
return_prob_mi |
logical. Should the computed mutual information and the cancer site specific probabilities for these screened variants be returned? Defaults to TRUE. |
mi_rank_thresh |
rank threshold for screening variants. The top variants with rank(MI_values) <= mi_rank_thresh is returned. Defaults to 250. |
normalize_mi |
logical. Should mutual information be normalized by product of square-roots of marginal Shannon entropies? Defaults to FALSE. |
do_freq_screen |
logical. Should an overall (relative) frequency-based screening be performed prior to MI based screening? This may reduce the computation load substantially for whole genome data where potentially tens of millions of variants are observed only once. Defaults to FALSE. |
thresh_freq_screen |
Threshold for overall pan-cancer relative frequency
to use if a frequency-based screening is performed before mi based
screening. Defaults to 1/n_sample where n_sample is the pan-cancer
total number of tumors. Ignored if |
... |
Unused. |
The function first estimates via relative frequencies the cancer site
specific probabilities of encountering EACH variant in the maf file. Then using
these estimated probabilities and the marginal probabilities of cancer sites,
the (possibly normalized) mutual information between (a) the occurrence of a
variant-"j" in randomly chosen tumor and (b) the cancer site of the associated
tumor is computed for each variant-j in maf
.
These MIs are then ranked and the variant labels associated with with
mi rank <= mi_rank_thresh
are returned.
a character vector listing the screened variant labels (sorted with the first
one having the highest MI) with ranks <= mi_rank_thresh
.
Optionally, if return_prob_mi = TRUE
, then
a data table named prob_mi
listing cancer site specific probabilities
of ALL variants and the associated MIs are returned.
data("impact")
top_v <- screen_variant_mi(
maf = impact,
variant_col = "Variant",
cancer_col = "CANCER_SITE",
sample_id_col = "patient_id",
mi_rank_thresh = 200,
return_prob_mi = FALSE
)
top_v
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.