mutSigExtractor is an R package for extracting mutation contexts from vcf files. Extraction can be performed for the following mutation types:
Signatures can also be extracted from these contexts for SNVs, indels, DBSs (COSMIC/PCAWG), as well as for SVs (Nik-Zainal et al. 2016).
mutSigExtractor requires some bioconductor packages to first be installed.
## Bioconductor packages required by mutSigExtractor install.packages('BiocManager') BiocManager::install('BSgenome') ## Install genome parser BiocManager::install('BSgenome.Hsapiens.UCSC.hg19') ## Install the default genome BiocManager::install('GenomeInfoDb') ## Install mutSigExtractor directly from github using devtools install.packages("devtools") devtools::install_github('https://github.com/UMCUGenetics/mutSigExtractor/')
Other genomes (e.g. for hg38: BSgenome.Hsapiens.UCSC.hg38
) can also be used. Please see the below
tutorial for details.
The COLO829v003T.purple.somatic.vcf.gz
and COLO829v003T.purple.sv.vcf.gz
files in doc/vcf/
will be used to demonstrate how to use mutSigExtractor.
pkg_dir <- '/Users/lnguyen/hpc/cuppen/projects/P0013_WGS_patterns_Diagn/CHORD/processed/scripts_main/mutSigExtractor/' devtools::load_all(pkg_dir) vcf_snv <- paste0(pkg_dir, '/doc/vcf/COLO829v003T.purple.somatic.vcf.gz') vcf_sv <- paste0(pkg_dir,'/doc/vcf/COLO829v003T.purple.sv.vcf.gz')
setwd('/path/to/mutSigExtractor') vcf_snv <- 'doc/vcf/COLO829v003T.purple.somatic.vcf.gz' vcf_sv <- 'doc/vcf/COLO829v003T.purple.sv.vcf.gz'
The main functions for extracting signatures are:
extractSigsSnv() extractSigsDbs() extractSigsIndel() extractSigsSv()
Note that SNVs, DBSs and indels are often reported in the same vcf file. Therefore,
extractSigsSnv()
,extractSigsDbs()
, and extractSigsIndel()
will automatically select the
relevant mutation types.
Here, extraction of SNV contexts and signatures on one sample using extractSigsSnv()
will be
demonstrated. The same concepts shown here can be applied to the other mutation type functions.
With the below code we can extract the 96 trinucleotide contexts. This returns a single column matrix of mutation counts per context.
Why is the output not simply a vector? This is so that the output can be written to a txt file,
which is useful when processing a large number of samples on an HPC. When processing multiple
samples locally, one can simply use cbind()
combine the context counts from all samples into one
matrix.
Note that, it is recommended that the vcf.filter
argument is set to 'PASS' (or '.' for certain vcf
files) to remove low quality variants.
contexts_snv <- extractSigsSnv(vcf.file=vcf_snv, vcf.filter='PASS', output='contexts') head(contexts_snv)
To extract signatures, we can then fit these contexts to e.g. the COSMIC or PCAWG signature profiles.
These are included in the package as SBS_SIGNATURE_PROFILES_V2
and SBS_SIGNATURE_PROFILES_V3
respectively. For this example we will use the PCAWG profiles (SBS_SIGNATURE_PROFILES_V3
).
SBS_SIGNATURE_PROFILES_V3[1:5,1:5]
Signature fitting is done using fitToSignatures()
This function uses the non-negative linear least
squares algorithm based on lsqnonneq()
from the pracma
package. mut.context.counts
can be a numeric
vector, matrix, or dataframe. If a matrix/dataframe, rows represent samples and columns represent
contexts.
sigs_snv <- fitToSignatures( mut.context.counts=contexts_snv[,1], signature.profiles=SBS_SIGNATURE_PROFILES_V3 ) head(sigs_snv)
Alternatively, signatures can be extracted directly from the vcf by specifying output='signatures'
in extractSigsSnv()
sig_snv_2 <- extractSigsSnv( vcf.file=vcf_snv, vcf.filter='PASS', output='signatures', signature.profiles=SBS_SIGNATURE_PROFILES_V3 ) head(sig_snv_2)
The context extractions for mutation types other than SNVs are shown below.
contexts_dbs <- extractSigsDbs(vcf.file=vcf_snv, vcf.filter='PASS', output='contexts') head(contexts_dbs)
For indels, extractSigsIndel()
defaults to method='CHORD'
which is used by Classifier of
HOmologous Recombination Deficiency (CHORD). When method='CHORD'
, contexts are always extracted
(i.e. no output
argument).
contexts_indel <- extractSigsIndel(vcf.file=vcf_snv, vcf.filter='PASS') head(contexts_indel)
However, the PCAWG indel contexts can also be extracted by setting method='PCAWG'
. Here, output
can be 'signatures'
to directly extract the PCAWG indel signatures.
contexts_indel <- extractSigsIndel(vcf.file=vcf_snv, vcf.filter='PASS', method='PCAWG', output='contexts') head(contexts_indel)
SV vcf files generally do not adhere to one standard. extractSigsSv()
currently supports SV vcf
parsing of GRIDSS (conforms to vcf spec 4.2) and Manta vcfs. You can specify the vcf type
with sv.caller='gridss'
(default) or sv.caller='manta'
.
contexts_sv <- extractSigsSv(vcf.file=vcf_sv, vcf.filter='PASS', output='contexts', sv.caller='gridss') head(contexts_sv)
In case you want to use SV vcfs from other callers, you can provide a dataframe as input to
extractSigsSv()
containing SV type and length info. See below for more info.
Alternatively, dataframes can be used as input, which is handy if you want to parse the vcfs yourself, or have inputs in tabular format.
For SNVs and indels, a dataframe with the columns: chrom, pos, ref, alt.
## CHROM POS REF ALT ## 1 1 16145827 A C ## 2 1 16492085 G C ## 3 1 17890303 C G ## 4 1 18877885 G A ## 5 1 18919776 T C
For SVs, a dataframe with the columns: sv_type, sv_len. For sv_type, values must be DEL, DUP, INV, TRA (deletions, duplications, inversions, translocations). For translocations, sv_len information is discarded. The column names themselves do not matter, as long as the columns are in the aforementioned order.
## sv_type sv_len ## 1 TRA NA ## 2 DEL 1696 ## 3 DEL 22644 ## 4 DUP 1703 ## 5 DEL 1789 ## 6 DEL 49256
These dataframe can be provided to the extractSigs* functions using the argument df
. For example:
extractSigsSv(df=sv_dataframe, vcf.filter='PASS', output='contexts', sv.caller='gridss')
Below is a summary of the signature profiles pre-loaded within mutSigExtractor
SBS_SIGNATURE_PROFILES_V2
: Original 30 SNV signaturesSBS_SIGNATURE_PROFILES_V3
: PCAWG SNV signaturesINDEL_SIGNATURE_PROFILES
: PCAWG indel signaturesDBS_SIGNATURE_PROFILES
: PCAWG DBS signaturesSV_SIGNATURE_PROFILES
: SV signatures from Nik-Zainal et al. 2016
with clustered SV information removedA different reference genome than the default (BSgenome.Hsapiens.UCSC.hg19) can be used. Genomes
should be BSgenomes. The variable name (i.e. no quotes) of the BSgenome object is specified to
ref.genome
.
## Make sure to install and load the desired ref genome first install.packages('BiocManager') BiocManager::install('BSgenome.Hsapiens.UCSC.hg38') ## Non-default genomes need to be explicitly loaded. The default (BSgenome.Hsapiens.UCSC.hg19) ## is automatically loaded. library(BSgenome.Hsapiens.UCSC.hg38) ## Specify the name of the BSgenome object to the ref.genome argument extractSigsSnv( vcf.file='/path/to/vcf/with/snvs_and_indels/', vcf.filter='PASS', output='contexts', ref.genome=BSgenome.Hsapiens.UCSC.hg38 )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.