knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
This package implements an entirely novel approach for the detection and extraction of mutational processes based on Yun Feng's work (convSig project, Oxford University PhD).
Cancer patients have distinct mutational patterns in their genome, hence detecting, extracting, and characterising such patterns can provide a useful source of information for personalized of cancer treatment. The methods implemented in this package makes use of convolutional filtering techniques found in computer vision. Our model is centred around the concept of each mutational process being represented as a filter, which slides over the entire genome of interest and probabilistically induces mutations in genetic sequences that it has a strong rapport with. Our methods extract and characterise these processes using two variations of expectation maximisation (i.e. EM).
The EM methods (ReLU and exponential transformations) both require mutational genome
background counts and the activity rate of mutational processes. In order to extract from
your mutation input file those assets, we need to apply a couple of functions. Please
note that this section is related to the ICGC pipeline (i.e. International Cancer Genome Consortium), which must be multi-sampled
and must conform to the following format (use loadICGCexample()
for more complete
example):
library(convSig) library(data.table) library(kableExtra) library(dplyr) setwd("..") loadICGCexample() res <- as.data.table(example_mutation_dataset) res <- res[, c('icgc_sample_id', 'chromosome', 'chromosome_start', 'chromosome_end', 'mutated_from_allele', 'mutated_to_allele', 'assembly_version', 'sequencing_strategy')] res <- res[c(1:5),] res %>% kable("html") %>% kable_styling(font_size = 9)
To find other such mutation files please visit the online ICGC data portal (dcc.icgc.org), navigate to DCC DATA RELEASE, select a project, and download any file which contains "simple_somatic_mutation"in its name.
The icgc2mut
function call performs initial formatting of the mutation data for the ICGC pipeline. More precisely, it skips all of the X and Y chromosomes,
gathers all of the mutations that pertain to experiments run with user specified assembly version and sequencing strategy, and applies a base to number conversion to help filter out invalid types of mutations such as insertions and deletions.
datapath <- "simple_somatic_mutation.open.COCA-CN.tsv" mut_file <- icgc2mut(datapath, assembly = "GRCh37", Seq = "WGS")
The ICGC mutation file can either be the path to the mutation file on your workstation or data previously loaded into your R global environment (as a matrix
or data.frame
).
Assembly is the reference genome file that corresponds to the relevant experiment and the sequencing strategy used to obtain it.
The icgc_curate()
function call completes the formatting of the mutation data.
cur_mut_file <- icgc_curate(mut_file, remove.nonSNP = TRUE)
The input is first sorted in ascending order by chromosome number first and then by the mutation chromosome start position. As an optional step (using remove.nonSNP = TRUE
), it removes all non-single nucleotide polymorphisms. And, finally all duplicate mutations present in the data are removed.
The mut_count()
function constructs from the formatted mutation data two essential tables, the mutation rate and genome background count tables, which are required for inferrence via expectation maximisation.
assembly <- "Homo_sapiens.GRCh37.dna.primary_assembly.fa" EMu_prepped <- mut_count(assembly, cur_mut_file, five = TRUE)
Please note that this function call requires you to supply it with the correct corresponding reference genome. The function call can also take a parameter that helps indicate whether you wish to construct the result tables for a 3-base long or a 5-base long mutation signature window. By default, the call is set to run for a 3-base long window.
The mutational genome background counts and the activity rate of mutational processes
can also be obtained from a multi-sampled VCF file. The vcf2mut
function call performs complete formatting on user specified vcf input file. It then constructs from the formatted mutation data two essential tables, the mutation rate and genome background count tables.
assembly <- "Homo_sapiens.GRCh37.dna.primary_assembly.fa" mut_file <- "multisampled_mutations.vcf" EMu_prepped <- vcf2mut(mut_file, geno = "GT", assembly, five = TRUE)
Geno is a string indicating the designation of the ID field within the FORMAT column, which specifies the 'Genotype' value. This allows the function to count the number mutations per chromosome location and per samples.
The mutation rate table is a matrix, which records the frequency of each mutation type for each sample present in supplied mutation input data. Each of its rows therefore represents a sample and each of its columns represents a distinct mutation type/possibility. The number of possible mutation types is calculated in accordance with the length of the mutation signature window (i.e. 3 or 5 bases only for our methods). The formula used is: $4^{numbase - 1} \times 2 \times 3$ (where numbase is equal to the length of the window). We first perform $4^{numbase - 1} \times 2$ because all flanking bases in the window (i.e. non-middle bases) have exactly 4 base possibilities and the middle base only has 2 possibilities. The latter is due to the reverse complementarity nature of the genome; This is to avoid double counting the mutations that have actually occurred. We furthermore multiply the intermediate result by 3 here because of the three possibilities for the middle base substitution.
The genome background count table is a vector with length equal to the total possible number of mutation types as calculated above. It also records the frequency that we could expect from each mutation under circumstances of equal probability for the occurrence of such mutations.
The ReLU and the Exponential transformations both enable the learning of key parameters via a novel expectation maximization technique (EM) subject to a negative log-likelihood loss function. The parameters whose estimation are sought for are a feature matrix (contains the weights of the mutational filters), a P matrix (contains the mutational rate for each mutational process in each sample), and a mat matrix (contains the probabilities for all mutation types; Take T mutating to C as an instance of a mutation type).
The Feature matrix has $(numbase \times 4) - 2$ rows. This number covers all possible bases at each base index in a particular trimer fragment - 4 possible bases that pertain to a single base index are represented as 4 contiguous rows in the matrix. The Feature matrix also has the same number of columns as the number of mutational processes as indicated by the user.
The P matrix has as many rows as the number of samples present in your input data. It also contains the same number of columns as the number of mutational processes specified by the user.
The mat result is a 3 dimensional matrix with dimensions (2,3,number of mutational processes). The first dimension equals 2 because there are only two distinct original bases possible pre-mutation; That is because of the reverse complementarity nature of the genome. The second dimension equals 3 because there are only three possible mutated bases per each of the 2 original bases.
The below function calls require the data output by the previous function call (i.e. mut_count()
), a parameter specifying whether the transformation should consider a 3-base long window vs a 5-base long window, and a parameter indicating the number of mutational processes to extract from the pre-processed input data.
relu_res <- relu_transform(EMu_prepped, five = TRUE, K = 5)
exp_res <- exp_transform(EMu_prepped, five = TRUE, K = 5)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.