create_features_df: Create Data Frame of Features for Driver Gene Prioritization
In driveR: Prioritizing Cancer Driver Genes Using Genomics Data

create_features_df

R Documentation

Create Data Frame of Features for Driver Gene Prioritization

Description

Create Data Frame of Features for Driver Gene Prioritization

Usage

create_features_df(
  annovar_csv_path,
  scna_df,
  phenolyzer_annotated_gene_list_path,
  batch_analysis = FALSE,
  prep_phenolyzer_input = FALSE,
  build = "GRCh37",
  log2_ratio_threshold = 0.25,
  gene_overlap_threshold = 25,
  MCR_overlap_threshold = 25,
  hotspot_threshold = 5L,
  log2_hom_loss_threshold = -1,
  verbose = TRUE,
  na.string = "."
)

Arguments

`annovar_csv_path`	path to 'ANNOVAR' csv output file
`scna_df`	the SCNA segments data frame. Must contain: chr chromosome the segment is located in start start position of the segment end end position of the segment log2ratio log₂ ratio of the segment
`phenolyzer_annotated_gene_list_path`	path to 'phenolyzer' "annotated_gene_list" file
`batch_analysis`	boolean to indicate whether to perform batch analysis (`TRUE`, default) or personalized analysis (`FALSE`). If `TRUE`, a column named 'tumor_id' should be present in both the ANNOVAR csv and the SCNA table.
`prep_phenolyzer_input`	boolean to indicate whether or not to create a vector of genes for use as the input of 'phenolyzer' (default = `FALSE`). If `TRUE`, the features data frame is not created and instead the vector of gene symbols (union of all genes for which scores are available) is returned.
`build`	genome build for the SCNA segments data frame (default = "GRCh37")
`log2_ratio_threshold`	the log₂ ratio threshold for keeping high-confidence SCNA events (default = 0.25)
`gene_overlap_threshold`	the percentage threshold for the overlap between a segment and a transcript (default = 25). This means that if only a segment overlaps a transcript more than this threshold, the transcript is assigned the segment's SCNA event.
`MCR_overlap_threshold`	the percentage threshold for the overlap between a gene and an MCR region (default = 25). This means that if only a gene overlaps an MCR region more than this threshold, the gene is assigned the SCNA density of the MCR
`hotspot_threshold`	to determine hotspot genes, the (integer) threshold for the minimum number of cases with certain mutation in COSMIC (default = 5)
`log2_hom_loss_threshold`	to determine double-hit events, the log₂ threshold for identifying homozygous loss events (default = -1).
`verbose`	boolean controlling verbosity (default = `TRUE`)
`na.string`	string that was used to indicate when a score is not available during annotation with ANNOVAR (default = ".")

Value

If prep_phenolyzer_input=FALSE (default), a data frame of features for prioritizing cancer driver genes (gene_symbol as the first column and 26 other columns containing features). If prep_phenolyzer_input=TRUE, the functions returns a vector gene symbols (union of all gene symbols for which scores are available) to be used as the input for performing 'phenolyzer' analysis.

The features data frame contains the following columns:

gene_symbol: HGNC gene symbol
metaprediction_score: the maximum metapredictor (coding) impact score for the gene
noncoding_score: the maximum non-coding PHRED-scaled CADD score for the gene
scna_score: SCNA proxy score. SCNA density (SCNA/Mb) of the minimal common region (MCR) in which the gene is located
hotspot_double_hit: boolean indicating whether the gene is a hotspot gene (indication of oncogenes) or subject to double-hit (indication of tumor-suppressor genes)
phenolyzer_score: 'phenolyzer' score for the gene
hsa03320: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04010: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04020: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04024: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04060: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04066: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04110: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04115: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04150: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04151: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04210: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04310: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04330: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04340: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04350: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04370: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04510: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04512: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04520: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04630: boolean indicating whether or not the gene takes part in this KEGG pathway
hsa04915: boolean indicating whether or not the gene takes part in this KEGG pathway

Examples


path2annovar_csv <- system.file("extdata/example.hg19_multianno.csv",
                                package = "driveR")
path2phenolyzer_out <- system.file("extdata/example.annotated_gene_list",
                                   package = "driveR")
features_df <- create_features_df(annovar_csv_path = path2annovar_csv,
                                  scna_df = example_scna_table,
                                  phenolyzer_annotated_gene_list_path = path2phenolyzer_out)

driveR documentation built on Aug. 19, 2023, 5:12 p.m.