knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(cancereffectsizeR)

Introduction

cancereffectsizeR is an R package that may be used to calculate the effect size of single nucleotide variants (SNV) in cancer exome data[^1]. It was designed for use with datasets in MAF format and works well with data from a large number of tumors, each with a large number of detected variants. A previous version of cancereffectsizeR depended on output from MutSigCV (which runs in MATLAB) and can be found here.

This vignette is broken into several sections detailing installation, usage, and a suite of R functions provided with this package that are useful for analyzing cancer exome data. This user guide covers the version of cancereffectsizeR used in our JNCI manuscript https://doi.org/10.1093/jnci/djy168 (v0.1.0).

[^1]: Cannataro, V. L., Gaffney, S. G., Townsend, J. P., “Effect sizes of somatic mutations in cancer" JNCI, (2018): https://doi.org/10.1093/jnci/djy168

Installing

cancereffectsizeR utilizes a few bioinformatic R packages, and thus requires the installation of some dependencies. If you already use R to bioinformatics, you likely already have these packages installed and you can directly install cancereffectsizeR in two lines.

install.packages("devtools",repos = "https://cloud.r-project.org")
devtools::install_github("Townsend-Lab-Yale/cancereffectsizeR@0.1.0")

However, if you just downloaded R today, you will need to install the dependencies.

source("https://bioconductor.org/biocLite.R")
biocLite("rtracklayer")
biocLite("GenomicRanges")
biocLite("BSgenome")
biocLite("BSgenome.Hsapiens.UCSC.hg19")
install.packages("deconstructSigs",repos = "https://cloud.r-project.org")
install.packages("devtools",repos = "https://cloud.r-project.org")
devtools::install_github("im3sanger/dndscv@0.1.0")
devtools::install_github("Townsend-Lab-Yale/cancereffectsizeR@0.1.0")

Calculating effect size

In this example, we will calculate the effect size of SNV within LGG (low grade glioma) using the TCGA LGG dataset provided at the National Cancer Institute Genomic Data Common. Specifically using the MAF generated with mutect variant calling (NCI UUID 2c0dab06-7af9-4380-bc83-13148298da19). Download and read the MAF into memory.

# MAF files are tab delim and contain 5 rows of header to skip
LGG_MAF <- read.delim(
  file = "../vignettes/TCGA.LGG.mutect.2c0dab06-7af9-4380-bc83-13148298da19.DR-7.0.somatic.maf",
  header = T,skip = 5,stringsAsFactors = F)

Preprocessing

Converting hg38 to hg19

The latest TCGA data release has genomic variants in hg38 coordinates, so we need to convert these to hg19 so that the data is compatable with other packages utilized within cancereffectsizeR. hg_converter uses the hg38ToHg19.over.chain file and the rtracklayer::liftOver function to perform the conversion. Note that this function can convert between other builds with other *.over.chain files.

library(cancereffectsizeR) # load in the package 

# provide the path to the over.chain file. 
# I downloaded the file from 
# <http://hgdownload.cse.ucsc.edu/gbdb/hg38/liftOver/hg38ToHg19.over.chain.gz>
LGG_MAF <- hg_converter(chain = "~/Downloads/hg38ToHg19.over.chain",
                        maf_to_convert = LGG_MAF)

Adding columns with tumor sample identifier data and tumor allele data.

The TCGA does a great job documenting their data in a consistent fashion, so this step is optional if just using TCGA data. However, other sources may be less reliable, so these functions are provided in an attempt to get consistent tumor names and tumor allele nucleotides.

LGG_MAF <- unique_tumor_addition_function(MAF.file = LGG_MAF)
LGG_MAF <- tumor_allele_adder(MAF = LGG_MAF)

Removing DNV and TNV

We only want true single-nucleotide variant events in our data so we can get the best estimate of SNV mutation rate. Variant calling algorithms may mislabel di-nucleotide events as "SNV" and thus we need to remove these data before calculating mutation rate and effect size.

LGG_MAF <- DNP_TNP_remover(MAF = LGG_MAF)

Calculating cancer effect size

The effect_size_SNV() function contains the entire pipeline necessary to calculate effect sizes, assuming your data is correctly preprocessed (in hg19 coordinates, no DNP, etc.). The function utilizes deconstructSigs[^2] and dndscv[^3], among other freely available R packages, to first determine the intrinsic mutation rate at all sites in each gene analyzed, and then the effect size given the detected prevalence of the mutation among sequenced tumors.

[^2]: Rosenthal, R., McGranahan, N., Herrero, J., Taylor, B. S., & Swanton, C. (2016). deconstructSigs: delineating mutational processes in single tumors distinguishes DNA repair deficiencies and patterns of carcinoma evolution. Genome Biology, 17(1), 31. https://doi.org/10.1186/s13059-016-0893-4

[^3]: Martincorena, I., Raine, K. M., Gerstung, M., Dawson, K. J., Haase, K., Van Loo, P., … Campbell, P. J. (2017). Universal Patterns of Selection in Cancer and Somatic Tissues. Cell. https://doi.org/10.1016/j.cell.2017.09.042

LGG_selection_output <- effect_size_SNV(MAF_file = LGG_MAF,
                                        covariate_file = "lgg_pca",
                                        genes_for_effect_size_analysis =
                                          c("IDH1","EGFR","TP53","BRAF"))

Interpreting results

Output is grouped into four main sections.

Selection intensity summary

The main output regarding effect size calculations is found within LGG_selection_output$selection_output$all_mutations. This dataframe contains useful information for each unique molecular variant in the analysis, included selection intensity, mutation rate, frequency of occurrence within the dataset (along with proportion of tumors with the specific variant (Prop_tumors_with_specific_mut) and proportion of tumors with a mutation in this gene (Prop_of_tumors_with_this_gene_mutated).

# selection intensity data

print(head(
  LGG_selection_output$selection_output$all_mutations[
    order(
      LGG_selection_output$selection_output$all_mutations$selection_intensity,decreasing = T),],
  6),
  digits = 2)

A more detailed output is found within LGG_selection_output$selection_output$complete_mutation_data, which breaks down each molecular variants into the individual tumors it was found within.

print(head(LGG_selection_output$selection_output$complete_mutation_data),digits = 2)

$selection_output also contains information about the last gene analyzed, such as the nucleotide mutation rates at every position...

LGG_selection_output$selection_output$nucleotide_mutation_rates[,1:5]

... and the amino acid mutations at every position.

LGG_selection_output$selection_output$amino_acid_mutation_rates[,1:5]


Townsend-Lab-Yale/cancereffectsizeR documentation built on April 28, 2024, 6:14 p.m.