knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(cancereffectsizeR)
cancereffectsizeR
is an R
package that may be used to calculate the effect size of single nucleotide variants (SNV) in cancer exome data[^1]. It was designed for use with datasets in MAF format and works well with data from a large number of tumors, each with a large number of detected variants. A previous version of cancereffectsizeR
depended on output from MutSigCV
(which runs in MATLAB) and can be found here.
This vignette is broken into several sections detailing installation, usage, and a suite of R
functions provided with this package that are useful for analyzing cancer exome data. This user guide covers the version of cancereffectsizeR
used in our JNCI manuscript https://doi.org/10.1093/jnci/djy168 (v0.1.0).
[^1]: Cannataro, V. L., Gaffney, S. G., Townsend, J. P., “Effect sizes of somatic mutations in cancer" JNCI, (2018): https://doi.org/10.1093/jnci/djy168
cancereffectsizeR
utilizes a few bioinformatic R
packages, and thus requires the installation of some dependencies. If you already use R
to bioinformatics, you likely already have these packages installed and you can directly install cancereffectsizeR
in two lines.
install.packages("devtools",repos = "https://cloud.r-project.org") devtools::install_github("Townsend-Lab-Yale/cancereffectsizeR@0.1.0")
However, if you just downloaded R
today, you will need to install the dependencies.
source("https://bioconductor.org/biocLite.R") biocLite("rtracklayer") biocLite("GenomicRanges") biocLite("BSgenome") biocLite("BSgenome.Hsapiens.UCSC.hg19") install.packages("deconstructSigs",repos = "https://cloud.r-project.org") install.packages("devtools",repos = "https://cloud.r-project.org") devtools::install_github("im3sanger/dndscv@0.1.0") devtools::install_github("Townsend-Lab-Yale/cancereffectsizeR@0.1.0")
In this example, we will calculate the effect size of SNV within LGG (low grade glioma) using the TCGA LGG dataset provided at the National Cancer Institute Genomic Data Common. Specifically using the MAF generated with mutect variant calling (NCI UUID 2c0dab06-7af9-4380-bc83-13148298da19). Download and read the MAF into memory.
# MAF files are tab delim and contain 5 rows of header to skip LGG_MAF <- read.delim( file = "../vignettes/TCGA.LGG.mutect.2c0dab06-7af9-4380-bc83-13148298da19.DR-7.0.somatic.maf", header = T,skip = 5,stringsAsFactors = F)
The latest TCGA data release has genomic variants in hg38 coordinates, so we need to convert these to hg19 so that the data is compatable with other packages utilized within cancereffectsizeR
. hg_converter
uses the hg38ToHg19.over.chain file and the rtracklayer::liftOver
function to perform the conversion. Note that this function can convert between other builds with other *.over.chain files.
library(cancereffectsizeR) # load in the package # provide the path to the over.chain file. # I downloaded the file from # <http://hgdownload.cse.ucsc.edu/gbdb/hg38/liftOver/hg38ToHg19.over.chain.gz> LGG_MAF <- hg_converter(chain = "~/Downloads/hg38ToHg19.over.chain", maf_to_convert = LGG_MAF)
The TCGA does a great job documenting their data in a consistent fashion, so this step is optional if just using TCGA data. However, other sources may be less reliable, so these functions are provided in an attempt to get consistent tumor names and tumor allele nucleotides.
LGG_MAF <- unique_tumor_addition_function(MAF.file = LGG_MAF)
LGG_MAF <- tumor_allele_adder(MAF = LGG_MAF)
We only want true single-nucleotide variant events in our data so we can get the best estimate of SNV mutation rate. Variant calling algorithms may mislabel di-nucleotide events as "SNV" and thus we need to remove these data before calculating mutation rate and effect size.
LGG_MAF <- DNP_TNP_remover(MAF = LGG_MAF)
The effect_size_SNV()
function contains the entire pipeline necessary to calculate effect sizes, assuming your data is correctly preprocessed (in hg19 coordinates, no DNP, etc.). The function utilizes deconstructSigs
[^2] and dndscv
[^3], among other freely available R
packages, to first determine the intrinsic mutation rate at all sites in each gene analyzed, and then the effect size given the detected prevalence of the mutation among sequenced tumors.
[^2]: Rosenthal, R., McGranahan, N., Herrero, J., Taylor, B. S., & Swanton, C. (2016). deconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biology, 17(1), 31. https://doi.org/10.1186/s13059-016-0893-4
[^3]: Martincorena, I., Raine, K. M., Gerstung, M., Dawson, K. J., Haase, K., Van Loo, P., … Campbell, P. J. (2017). Universal Patterns of Selection in Cancer and Somatic Tissues. Cell. https://doi.org/10.1016/j.cell.2017.09.042
LGG_selection_output <- effect_size_SNV(MAF_file = LGG_MAF, covariate_file = "lgg_pca", genes_for_effect_size_analysis = c("IDH1","EGFR","TP53","BRAF"))
Output is grouped into four main sections.
LGG_selection_output$selection_output
LGG_selection_output$mutation_rates
LGG_selection_output$trinuc_data
LGG_selection_output$dndscvout
The main output regarding effect size calculations is found within LGG_selection_output$selection_output$all_mutations
. This dataframe contains useful information for each unique molecular variant in the analysis, included selection intensity, mutation rate, frequency of occurrence within the dataset (along with proportion of tumors with the specific variant (Prop_tumors_with_specific_mut
) and proportion of tumors with a mutation in this gene (Prop_of_tumors_with_this_gene_mutated
).
# selection intensity data print(head( LGG_selection_output$selection_output$all_mutations[ order( LGG_selection_output$selection_output$all_mutations$selection_intensity,decreasing = T),], 6), digits = 2)
A more detailed output is found within LGG_selection_output$selection_output$complete_mutation_data
, which breaks down each molecular variants into the individual tumors it was found within.
print(head(LGG_selection_output$selection_output$complete_mutation_data),digits = 2)
$selection_output
also contains information about the last gene analyzed, such as the nucleotide mutation rates at every position...
LGG_selection_output$selection_output$nucleotide_mutation_rates[,1:5]
... and the amino acid mutations at every position.
LGG_selection_output$selection_output$amino_acid_mutation_rates[,1:5]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.