In J0bbie/R2CPCT: Misc. functions to perform import and analysis of WGS and RNA-Seq samples

knitr::opts_chunk$set(echo = TRUE)

Generation of the discovery-set of genes.

To associate mutations and other somatic aberrations to genes, we employed the comprehensive gene annotations from GENCODE (release 38; GRCh37) which we manually curated by removing genes classified as pseudogenes (except polymorphic pseudogenes), nonsense-mediated decay and/or to-be-experimentally-confirmed. In addition, clone-based genes (starting with AC[0-9], AP00 and RP11- within their gene symbols) were also removed and only retained genes with verified or manually-annotated loci (level 1 and 2).

We next determined the superset of non-overlapping exonic elements per gene based on the per-gene-associated transcripts, except those classified as pseudogenes (except polymorphic pseudogenes), nonsense-mediated decay, to-be-experimentally-confirmed, retained intron and non-step decay transcripts; i.e. if two transcripts (isoforms) are known for a single gene and they share 4 exons and each have one distinct non-overlapping unique exon, they will have in total 6 non-overlapping exons. Based on this non-overlapping exons-per-gene list, we only retained genes with at least one exon (excluding genes which did not pass the transcript-criteria or do not (yet) have any known exon). Gene aliases were converted to common gene symbols.

In total, this retained r dplyr::n_distinct(R2CPCT::GENCODE.v38$ENSEMBL) genes containing distinct ENSEMBL gene-identifiers, with the vast majority being protein-coding (n = r sort(table(R2CPCT::GENCODE.v38$gene_type), T)[[1]]) and lncRNA (n = r sort(table(R2CPCT::GENCODE.v38$gene_type), T)[[2]]) genes.

Aggregation of a compendium of known cancer-associated genes.

A custom list of driver genes was generated by combining multiple data-sources containing genes which were previously annotated as being driver- or otherwise cancer-associated (e.g. tumor suppressor genes and (proto-)oncogenes).

We combined and curated the following data-sources: COSMIC - Cancer Gene Census (v92), Priestley et al. (2020; mutations and copy-number alterations), Dietlein et al. (2020), Martincorena et al. (2017), IntOGen Compendium Cancer Genes (02-2020) and Bailey et al. (2018). In addition, we supplemented this list with the following genes: SOX11, ZEB1, ZEB2, TWIST1, TWIST2, SNAI1, SNAI2 and IKZF2.

Gene aliases were converted to common gene symbols and the corresponding ENSEMBL identifier and genomic location was retrieved through merging with GENCODE annotation (release 38), incomplete cases were further annotated via biomaRt (vr packageVersion('biomaRt')) using the GRCh37 cache.

This combined list (n = r dplyr::n_distinct(R2CPCT::driverList$SYMBOL)) was used to assess whether a genes were previously associated with cancer, any novel genes which were detected as significantly perturbed or cancer-associated (see M&M) were manually curated to establish any previous associations to disease by investigating the relevant literature.

Annotation and import of the mutational consequences of somatic mutations.

Somatic mutations (SNVs, InDels and MNVs), as determine by the Hartwig Medical Foundation pipeline (HMF-Pipeline5), were further annotated by VEP (v104) using GENCODE annotations (release 38) and two additional plugins; CADD (v1.6) and FATHMM_MKL (Nov. 2020). Required data-files for CADD and FATHMM_MKL were downloaded (GRCh37) and processed as per the author instructions. In addition, the gnoMAD exome and genome databases (v2.1.1) were used to annotate the allele frequencies for any of the present gnoMAD populations and the ClinVar database (27-09-2021) was used to annotate known associations. For each genomic mutation, VEP was set to only pick and annotate the most severe mutational consequence of all overlapping genes (--pick), i.e. only one consequence in a single gene even if multiple genes overlap (with less severe consequence or otherwise lower in the ordered set of picking criteria).

Somatic mutations were imported by the VariantAnnotation package (vr packageVersion('VariantAnnotation')) and subsequently filtered on the following thresholds, retaining only events which passed all default (Strelka) filters (PASS-only), were not present in >5 samples of the HMF panel-of-normals (PON) and which were below a max. allele frequency (for any population) of 0.001 in the gnoMAD exome and 0.005 in the gnoMAD genome databases.

Import and handling of the structural variants.

Structural variants (as detected and processed by PURPLE/GRIDSS) were imported using the StructuralVariantAnnotation package (vr packageVersion('StructuralVariantAnnotation')). Both partnered and un-partnered (i.e. single break-ends) structural variants were imported if they passed on default filtering QC (PASS-only). Structural variants were classified into the following categories:

TRA (translocation): chromosomes of the partnered break-ends did not match (i.e. chr1 and chr2).
INV (inversion): Strands of the partnered break-ends are equal (head-to-head (h2h) or tail-to-tail (t2t)).
INS (insertion): An inserted nucleotide sequence larger than 70% of the total width of the structural variant was detected.
DEL (deletion): The start of the upper break-end is located upstream of the lower break-end.
SINGLE (single): If the event only has one detected break-end and thus no partner.
DUP (tandem duplication): If not in any of the other categories.

When calculating the number of structural variants per sample, only unique events were counted (i.e. partnered break-ends are counted once and single events are counted once).

Assessment of homologous-repair deficiency.

We employed CHORD (vr packageVersion('CHORD')) to assess whether an sample showed evidence of genomic scarring and other features related to homologous-repair deficiency (HRD). Per sample, we inputted the processed SNV, InDels and SV into CHORD using the BSgenome.Hsapiens.UCSC.hg19 (vr packageVersion('BSgenome.Hsapiens.UCSC.hg19')) as reference genome. We maintained the default HRD-threshold of ≥0.5 and default min. filtering criteria to determine HRD-status.

Aggregation of sample-specific gene-level mutational consequences.

Per sample and for each gene (GENCODE release 38), we aggregated all protein-coding and splicing mutational consequences as assessed by VEP (i.e. missense, frame-shift, splice acceptor variant etc.) except synonymous mutations. If a gene harbored multiple mutations, this was classified as 'Multiple Mutations'. The predicted variant impact (unknown, modifier, low, medium or high), dbSNP (v153) and COSMIC (v90) identifiers and tumor-corrected alelle frequency (TAF) was also stored within this aggregation.

Detection of sample-specific driver events and putative gene-fusions.

We performed LINX (v1.11) using ENSEMBL annotations (v101; similar to GENCODE v38) using default settings to group together structural variants (as detected by HMF-Pipeline5 / GRIDSS) to determine major events (LOH, deletion, disruption and amplification events) overlapping known driver-genes (as defined by HMF[ref]. Furthermore, putative fusion-genes were detected by LINX as derived from the structural variants.

In addition, the driver catalog as detected by PURPLE (on an a priori panel of genes) was appended to the mutational overview.

Cohort-wide detection of genes enriched with somatic mutations (dN/dS).

We applied the dN/dS method using the dndscv package (vr packageVersion('dndscv')) with a custom database based on ENSEMBL coding-sequences (v101; similar to GENCODE v38). We inputted all (retained) SNVs, InDels and MNVs using default settings.

Genes with either the qglobal_cv, qallsubs_cv, qtrunc_cv, qmis_cv, qall_loc or qmis_loc ≤ 0.05 were classified as being significant for positive or negative enrichment.

Assessment of sample-specific gene-level copy-number alterations.

Per sample, we overlapped the copy-number segments (as detected by PURPLE) with the non-overlapping exonic elements of each gene (min. 10bp overlap) and used the purity-corrected absolute copy-number estimates to determine the mean purity-corrected absolute copy-number of each gene, averaging on a per-base resolution. We also noted the mean purity-corrected absolute copy-number of each individual exon per gene to determine whether only a portion (i.e. 3 out of 30 exons) were perturbed.

We used the following thresholds to the denote copy-number alteration states of genes and their individual exons (where CN is the mean absolute copy-number estimate of the element and p is the genome-wide ploidy of the respective sample):

CN ≤ max(0.75 | (p / 3)): Deep Deletion
(p / 3) < CN ≤ (p / 1.5): Deletion
(p / 1.5) < CN < (p x 1.5): Neutral
(p x 1.5) ≤ CN < (p x 3): Amplification
CN ≥ min(15 | p x 3): Deep Amplification

In the case of haploid chromosomes in male samples, p was set to p = p - 1.

Detection of recurrent copy number aberrations (SCNAs)

We performed GISTIC2 analysis (v2.0.23) on the PURPLE-derived copy-number segments using tumor purity-corrected absolute copy-numbers as input (log2-transformed - 1, i.e. diploid is set to zero); haploid chromosomes in male samples were corrected by adding a pseudo-count (of 1) prior to log2-transformation. Segments with <-10 (log2-transformed) were set to -10.

GISTIC2 (v2.0.23) was performed using the following settings with default GRCh37 annotations (hg19.mat):

-genegistic 1 -gcm extreme -maxseg 4000 -broad 1 -brlen 0.98 -conf 0.95 -rx 0 -cap 3 -saveseg 0 -armpeel 1 -smallmem 0 -res 0.01 -ta 0.3 -td 0.3 -savedata 0 -savegene 0 -qvt 0.1

GISTIC2 output was imported and re-annotated using GENCODE annotations (v38; min. 10bp overlap) thereby using the wide-peak limits of the recurrent copy-number peaks (q ≤ 0.1) to classify the region containing the likely target(s) of the copy-number event.

Genes were annotated to GISTIC2 peaks (q ≤ 0.1) based on the following strategy;

All overlapping genes (min. 10bp) were assigned to the each GISTIC2 peak.
If multiple genes overlap a GISTIC2 peak, genes present within the combined driver list will be used to annotate that peak. E.g. if a GISTIC2 peak overlapped both MYC a and near-adjacent non-driver gene, only MYC would be chosen as possible target.
If no overlapping genes could be found, GISTIC2 peaks were annotated with the nearest GENCODE (v38) protein-coding gene.

The peak amplitude thresholds were used to represent the presence (or absence) of the observed GISTIC2 peak within each respective sample; Low amplitude (t > -0.3), Med. amplitude (-0.3 > t > -1.3) and High amplitude (t < -1.3).

Mutational signatures analysis.

Mutational signatures analysis was performed using the MutationalPatterns package (r packageVersion('MutationalPatterns')). Sample-specific signature refitting was done by finding the optimal contribution of the COSMIC signatures (v3.1; excluding sequencing artifacts) for SNV, DBS and InDels. Signature refitting was also subjected to bootstrapping (1000 iterations) and sample-wise refitted signatures with non-robust bootstrapping results (on average ≥1SD over all bootstrapping iterations) were classified as 'non-robust' within the respective sample.

We next also determined de novo mutational signatures using the Bayesion NMF method as implemented by the MutationalPatterns package.

J0bbie/R2CPCT documentation built on Feb. 24, 2022, 8:15 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

J0bbie/R2CPCT
Misc. functions to perform import and analysis of WGS and RNA-Seq samples

In J0bbie/R2CPCT: Misc. functions to perform import and analysis of WGS and RNA-Seq samples

Generation of the discovery-set of genes.

Aggregation of a compendium of known cancer-associated genes.

Annotation and import of the mutational consequences of somatic mutations.

Import and handling of the structural variants.

Assessment of homologous-repair deficiency.

Aggregation of sample-specific gene-level mutational consequences.

Detection of sample-specific driver events and putative gene-fusions.

Cohort-wide detection of genes enriched with somatic mutations (dN/dS).

Assessment of sample-specific gene-level copy-number alterations.

Detection of recurrent copy number aberrations (SCNAs)

Mutational signatures analysis.

R Package Documentation

Browse R Packages

We want your feedback!

J0bbie/R2CPCT Misc. functions to perform import and analysis of WGS and RNA-Seq samples

In J0bbie/R2CPCT: Misc. functions to perform import and analysis of WGS and RNA-Seq samples

Generation of the discovery-set of genes.

Aggregation of a compendium of known cancer-associated genes.

Annotation and import of the mutational consequences of somatic mutations.

Import and handling of the structural variants.

Assessment of homologous-repair deficiency.

Aggregation of sample-specific gene-level mutational consequences.

Detection of sample-specific driver events and putative gene-fusions.

Cohort-wide detection of genes enriched with somatic mutations (dN/dS).

Assessment of sample-specific gene-level copy-number alterations.

Detection of recurrent copy number aberrations (SCNAs)

Mutational signatures analysis.

R Package Documentation

Browse R Packages

We want your feedback!

J0bbie/R2CPCT
Misc. functions to perform import and analysis of WGS and RNA-Seq samples