context_potential_damage_analysis: Potential damage analysis for the supplied mutational...
In MutationalPatterns: Comprehensive genome-wide analysis of mutational processes

Description Usage Arguments Details Value Examples

View source: R/context_potential_damage_analysis.R

The ratio of possible 'stop gain', 'mismatches', 'synonymous mutations' and 'splice site mutations' is counted per mutational context. This is done for the supplied ENTREZ gene ids. This way it can be determined how damaging a mutational context could be. N gives the total number of possible mutations per context.

context_potential_damage_analysis(
  contexts,
  txdb,
  ref_genome,
  gene_ids,
  verbose = FALSE
)

`contexts`	Vector of mutational contexts to use for the analysis.
`txdb`	Transcription annotation database
`ref_genome`	BSGenome reference genome object
`gene_ids`	Entrez gene ids
`verbose`	Boolean. Determines whether progress is printed. (Default: FALSE)

The function works by first selecting the longest transcript per gene. The coding sequence (cds) of this transcript is then assembled. Next, the function loops over the reference contexts. For each context (and it's reverse complement), all possible mutation locations are determined. Splice site mutations are removed at this stage. It's also determined whether these locations are the first, second or third base of the cds codon (mut loc). Each unique combination of codon and mut loc is then counted. For each combination the reference amino acid and the possible alternative amino acids are determined. By comparing the reference and alternative amino acids, the number of 'stop_gains', 'mismatches' and 'synonymous mutations' is determined. This is then normalized per mutation context. For example, mutations with the ACA context could be located in the third position of a codon like TAC. This might happen 200 times in the supplied genes. This TAC codon could then be mutated in either a TAA, TAG or a TAT. The first two of these options would induce a stop codon, while the third one would be synonymous. By summing up all codons the number of stop_gains', 'mismatches' and 'synonymous mutations' is determined per mutation context.

For mismatches the blosum62 score is also calculated. This is a score based on the BLOSUM62 matrix, that describes how similar two amino acids are. This score is normalized over the total amount of possible mismatches. A lower score means that the amino acids in the mismatches are more dissimilar. More dissimilar amino acids are more likely to have a detrimental effect.

To identify splice sites, sequences around the splice locations are used instead of the cds. The 2 bases 5' and 2 bases 3' of a splice site are considered to be splice site mutation locations.

A tibble with the ratio of 'stop gain', 'mismatch', 'synonymous' and 'splice site' mutations per mutation context.

## See the 'mut_matrix()' example for how we obtained the
## mutation matrix information:
mut_mat <- readRDS(system.file("states/mut_mat_data.rds",
  package = "MutationalPatterns"
))

contexts <- rownames(mut_mat)

## Load the corresponding reference genome.
ref_genome <- "BSgenome.Hsapiens.UCSC.hg19"
library(ref_genome, character.only = TRUE)

## Load the transcription annotation database
## You can obtain the database from the UCSC hg19 dataset using
## Bioconductor:
# BiocManager::install("TxDb.Hsapiens.UCSC.hg19.knownGene")
library("TxDb.Hsapiens.UCSC.hg19.knownGene")
txdb <- TxDb.Hsapiens.UCSC.hg19.knownGene

## Here we will use the Entrez Gene IDs from several cancer
## genes. In practice you might want to use larger gene lists,
## but here we only use a few to keep the run-time low.
## In this example we are using:
## TP53, KRAS, NRAS, BRAF, BRCA2, CDKN2A, ARID1A, PTEN and TERT
gene_ids <- c(7157, 3845, 4893, 673, 675, 1029, 8289, 5728, 7015)

## Run the function
context_potential_damage_analysis(contexts, txdb, ref_genome, gene_ids)

## The function can provide updates about its progress.
## This can be usefull when it's running slowly,
## which can happen when you are using many gene_ids.
context_potential_damage_analysis(contexts, txdb, ref_genome, gene_ids, verbose = TRUE)