title: "Introduction to Excerno" author: "Audrey Mitchell, Marco Ruiz, Soua Yang" date: "2022-08-17" output: rmarkdown::html_vignette: keep_md: true vignette: > %\VignetteIndexEntry{excerno-intro} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown}
# library(excerno)
Formalin-Fixation Paraffin-Embedding (FFPE) is a preservation technique for cancer tissue samples which introduces novel mutations. Leveraging the known mutational signature of FFPE and mutational signatures from the Catalogue of Somatic Mutations in Cancer (COSMIC) library, we set out to classify and filter FFPE artifacts. Our method utilizes non-negative matrix factorization (MutationalPatterns R package) and Bayes’ formula to calculate the probability that each mutation in a sample was caused by FFPE. Our methods are implemented in this package, excerno.
excerno
provide functions to help classify single nucleotide variants to their possible origin signature.
excerno_vcf
takes VCF files and classifies each variant as "PASS" or "FFPE". It also provides the information of the probability of each variant generated by a distinct mutational signature
excerno_vcf
takes VCF files as its main input source. Here is an example of loading in VCF files (included in package).
vcf.files <- list.files(system.file("extdata", package = "excerno"), pattern = "SIMULATED_SAMPLE_SBS4_\\d.vcf", full.names = TRUE)
excerno_vcf
offers two methods for calculating the contribution of signatures in a sample: "nmf" or "linear." For each method, there are different input requirements.
If using method NMF to calculate the contribution of the signatures present in a sample, the number of signatures is required.
If using method NMF to calculate the contribution of the signatures present in a sample, the target signatures need to be provided as a matrix with mutation types for the row names and the signature name for the column names. Here's an example on creating a signature matrix for target.sig.
target.sigs <- matrix(nrow = 96, ncol = 2)
target.sigs[,1] <- cosmic.sig4
target.sigs[,2] <- ffpe.sig
rownames(target.sigs) <- get_mutation_types()
colnames(target.sigs) <- c("SBS4", "FFPE")
library(excerno)
# Load in signatures
cosmic.sigs <- get_known_signatures()
cosmic.sig4 <- as.matrix(cosmic.sigs[,4])
ffpe.sig <- get_ffpe_signature()
# Load in vcf files
vcf.files <- list.files(system.file("extdata", package = "excerno"), pattern = "SIMULATED_SAMPLE_SBS4_\\d.vcf", full.names = TRUE)
vcf.file <- "SIMULATED_SAMPLE_SBS4_1_classified.vcf"
method <- "nmf"
num.signatures <- 2
target.sigs <- matrix(nrow = 96, ncol = 2)
target.sigs[,1] <- cosmic.sig4
target.sigs[,2] <- ffpe.sig
rownames(target.sigs) <- get_mutation_types()
colnames(target.sigs) <- c("SBS4", "FFPE")
excerno_vcf(vcf.files, "linear", target.sigs = target.sigs)
We tested the performance of our method by simulating mutations to match particular distributions and running a modified version of our classifier function that allows for the comparison of the true source of simulated mutations and the source predicted by our method. Functions for simulating mutations and classifying simulated mutations are included in the package.
Load the mutational signatures from COSMIC version 3 using MutationalPatterns.
library(MutationalPatterns)
cosmic.sigs <- get_known_signatures()
Extract COSMIC Signature 4 from the matrix of all COSMIC mutations. Create a vector of the 96 single base substitution mutation types using `get_mutation_types
and assign to the rownames of the Signature 4 matrix for compatibility with MutationalPatterns plotting functions. Use plot_96_profile
from MutationalPatterns to visualize the distribution of Signature 4.
cosmic.sig4 <- as.matrix(cosmic.sigs[,4])
mutations <- get_mutation_types()
rownames(cosmic.sig4) <- mutations
plot_96_profile(cosmic.sig4)
Load the FFPE Signature using get_ffpe_signature
.
ffpe.sig <- get_ffpe_signature()
Create vectors of 100 mutations matching the distributions of COSMIC Signature 4 and the FFPE Signature using create_signature_sample_vector
.
sample.sig4 <- create_signature_sample_vector(cosmic.sig4, 100)
sample.ffpe <- create_signature_sample_vector(ffpe.sig, 100)
Use signature_cosine_similarity
to calculate the cosine similarity between the simulated vectors above and the original signatures.
signature_cosine_similarity(sample.sig4, cosmic.sig4)
signature_cosine_similarity(sample.ffpe, ffpe.sig)
Combine the Signature 4 and FFPE sample vectors and run the Bayesian classifier on the combined sample using classify_simulated_samples
.
# Turn into a list for input into classify_simulated_samples()
samples <- list(sample.sig4, sample.ffpe)
signatures <- list(cosmic.sig4, ffpe.sig)
classification.df <- classify_simulated_samples(samples, signatures)
classification.df
The function create_gr_from_sample
converts a classification data frame into a GRange Object. The classification data frame must have a mutations column and a truth column. The mutations column contains strings of mutations such as "A[C>T]A". The truth column indicates what mutational signature the mutation came from.
The output GRange object is meant to parallel a VCF file so additional columns are created in the GRange object: info, quality, filter, format, and samples (if provided values). These parameters are optional and will remain as NA if no values are provided.
seq <- getSeq(Hsapiens, "chr1")
classification.gr <- create_gr_from_sample(classification.df, seq, "chr1")
# Adding values to other columns
info <- sample("SOMATIC", 200, replace = TRUE)
quality <- sample(50:100, 200, replace = TRUE)
filter <- sample("PASS", 200, replace = TRUE)
format <- sample("GT:GQ", 200, replace = TRUE)
samples <- list(sample(paste("0/0:", 1:100, sep = ""), 200, replace = TRUE), sample(paste("0/0:", 1:100, sep = ""), 200, replace = TRUE))
sample.names <- c("SAMPLE1", "SAMPLE2")
classification.gr <- create_gr_from_sample(classification.df, seq, "chr1", info, quality, filter, format, samples, sample.names)
The function write_grange_to_vcf
takes in a GRange object and outputs a VCF file with the values from the GRange object.
vcf.filename <- "new_vcf.file"
write_grange_to_vcf(classification.gr, vcf.filename)
Determines how similar a mutational vector sample is to a mutational signature.
# Simulate sample
cosmic.sig4 <- as.matrix(get_known_signatures()[,4])
sample.sig4 <- create_signature_sample_vector(cosmic.sig4)
signature_cosine_similarity(sample.sig4, cosmic.sig4)
Outputs two probabilities for one type of the 96 mutations given two signatures and the contribution of each signature.
cosmic.sigs <- get_known_signatures()
# Get signatures
signatures <- matrix(nrow = 96, ncol = 2)
signatures[,1] <- cosmic.sigs[,4]
signatures[,2] <- get_ffpe_signature()
# Get contributions
contribution <- matrix(nrow = 2, ncol = 1)
contribution[,1] <- c(0.5, 0.5)
# Naming columns and rows
colnames(signatures) <- c("SBS4", "FFPE")
rownames(signatures) <- get_mutation_types()
rownames(contribution) <- c("SBS4", "FFPE")
mutation <- "A[C>T]A"
extract_all_prob(mutation, signatures, contribution)
Given a signature matrix, search the COSMIC database for the most similar signature and output the name of that signature.
cosmic.sigs <- get_known_signatures()
cosmic.sig4 <- as.matrix(cosmic.sigs[,4])
find_signature_name(cosmic.sig4)
Given a VCF file, create a vector of mutations in string form.
# Load file for testing
file <- system.file("extdata", "SIMULATED_SAMPLE_SBS4_1.vcf", package = "excerno")
vcf.vector <- get_mutational_vector(file)
Add the classifcation information from a classification data frame to its orginal vcf file. Takes in the orginal VCF file and a data frame with the classifications.
vcf.file <- system.file("extdata", "SIMULATED_SAMPLE_SBS4_1.vcf", package = "excerno")
# Load in signatures
cosmic.sigs <- get_known_signatures()
# Get signatures
signatures <- matrix(nrow = 96, ncol = 2)
signatures[,1] <- cosmic.sigs[,4]
signatures[,2] <- get_ffpe_signature()
rownames(signatures) <- get_mutation_types()
colnames(signatures) <- c("SBS4", "FFPE")
# Get contributions
contribution <- matrix(nrow = 2, ncol = 1)
contribution[,1] <- c(0.5, 0.5)
rownames(contribution) <- c("SBS4", "FFPE")
classification.df <- get_classification(signatures, contribution)
write_classification_to_vcf(vcf.file, classification.df)
Code for generating the samples included with this package.
set.seed(10)
sample.ffpe <- create_signature_sample_vector(ffpe.sig, 500)
sample.sig4 <- create_signature_sample_vector(cosmic.sig4, 500)
samples <- list(sample.sig4, sample.ffpe)
signatures <- list(cosmic.sig4, ffpe.sig)
classify.df <- classify_simulated_samples(samples, signatures)
info <- sample("SOMATIC", 1000, replace = TRUE)
quality <- sample(50:100, 1000, replace = TRUE)
filter <- sample("PASS", 1000, replace = TRUE)
format <- sample("GT:GQ", 1000, replace = TRUE)
samples <- list(sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE), sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE))
sample.names <- c("SAMPLE1", "SAMPLE2")
classify.gr <- create_gr_from_sample(classify.df, seq, "chr1", info, quality, filter, format, samples, sample.names)
file.name <- "SIMULATED_SAMPLE_SBS4_1.vcf"
write_grange_to_vcf(classify.gr, file.name)
set.seed(20)
sample.ffpe <- create_signature_sample_vector(ffpe.sig, 800)
sample.sig4 <- create_signature_sample_vector(cosmic.sig4, 200)
samples <- list(sample.sig4, sample.ffpe)
signatures <- list(cosmic.sig4, ffpe.sig)
classify.df <- classify_simulated_samples(samples, signatures)
info <- sample("SOMATIC", 1000, replace = TRUE)
quality <- sample(50:100, 1000, replace = TRUE)
filter <- sample("PASS", 1000, replace = TRUE)
format <- sample("GT:GQ", 1000, replace = TRUE)
samples <- list(sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE), sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE))
sample.names <- c("SAMPLE1", "SAMPLE2")
classify.gr <- create_gr_from_sample(classify.df, seq, "chr1", info, quality, filter, format, samples, sample.names)
file.name <- "SIMULATED_SAMPLE_SBS4_2.vcf"
write_grange_to_vcf(classify.gr, file.name)
set.seed(30)
sample.ffpe <- create_signature_sample_vector(ffpe.sig, 400)
sample.sig4 <- create_signature_sample_vector(cosmic.sig4, 600)
samples <- list(sample.sig4, sample.ffpe)
signatures <- list(cosmic.sig4, ffpe.sig)
classify.df <- classify_simulated_samples(samples, signatures)
info <- sample("SOMATIC", 1000, replace = TRUE)
quality <- sample(50:100, 1000, replace = TRUE)
filter <- sample("PASS", 1000, replace = TRUE)
format <- sample("GT:GQ", 1000, replace = TRUE)
samples <- list(sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE), sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE))
sample.names <- c("SAMPLE1", "SAMPLE2")
classify.gr <- create_gr_from_sample(classify.df, seq, "chr1", info, quality, filter, format, samples, sample.names)
file.name <- "SIMULATED_SAMPLE_SBS4_3.vcf"
write_grange_to_vcf(classify.gr, file.name)
set.seed(40)
sample.ffpe <- create_signature_sample_vector(ffpe.sig, 100)
sample.sig4 <- create_signature_sample_vector(cosmic.sig4, 900)
samples <- list(sample.sig4, sample.ffpe)
signatures <- list(cosmic.sig4, ffpe.sig)
classify.df <- classify_simulated_samples(samples, signatures)
info <- sample("SOMATIC", 1000, replace = TRUE)
quality <- sample(50:100, 1000, replace = TRUE)
filter <- sample("PASS", 1000, replace = TRUE)
format <- sample("GT:GQ", 1000, replace = TRUE)
samples <- list(sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE), sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE))
sample.names <- c("SAMPLE1", "SAMPLE2")
classify.gr <- create_gr_from_sample(classify.df, seq, "chr1", info, quality, filter, format, samples, sample.names)
file.name <- "SIMULATED_SAMPLE_SBS4_4.vcf"
write_grange_to_vcf(classify.gr, file.name)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.