vignettes/excerno-intro.md

title: "Introduction to Excerno" author: "Audrey Mitchell, Marco Ruiz, Soua Yang" date: "2022-08-17" output: rmarkdown::html_vignette: keep_md: true vignette: > %\VignetteIndexEntry{excerno-intro} %\VignetteEncoding{UTF-8} %\VignetteEngine{knitr::rmarkdown}

# library(excerno)

Introduction

Formalin-Fixation Paraffin-Embedding (FFPE) is a preservation technique for cancer tissue samples which introduces novel mutations. Leveraging the known mutational signature of FFPE and mutational signatures from the Catalogue of Somatic Mutations in Cancer (COSMIC) library, we set out to classify and filter FFPE artifacts. Our method utilizes non-negative matrix factorization (MutationalPatterns R package) and Bayes’ formula to calculate the probability that each mutation in a sample was caused by FFPE. Our methods are implemented in this package, excerno.

excerno provide functions to help classify single nucleotide variants to their possible origin signature.

Using excerno_vcf() on VCF files

excerno_vcf takes VCF files and classifies each variant as "PASS" or "FFPE". It also provides the information of the probability of each variant generated by a distinct mutational signature

Inputs

VCF file

excerno_vcf takes VCF files as its main input source. Here is an example of loading in VCF files (included in package).

vcf.files <- list.files(system.file("extdata", package = "excerno"), pattern = "SIMULATED_SAMPLE_SBS4_\\d.vcf", full.names = TRUE)

Method

excerno_vcf offers two methods for calculating the contribution of signatures in a sample: "nmf" or "linear." For each method, there are different input requirements.

Number of Signatures

If using method NMF to calculate the contribution of the signatures present in a sample, the number of signatures is required.

Target Signatures

If using method NMF to calculate the contribution of the signatures present in a sample, the target signatures need to be provided as a matrix with mutation types for the row names and the signature name for the column names. Here's an example on creating a signature matrix for target.sig.

target.sigs <- matrix(nrow = 96, ncol = 2)
target.sigs[,1] <- cosmic.sig4
target.sigs[,2] <- ffpe.sig
rownames(target.sigs) <- get_mutation_types()
colnames(target.sigs) <- c("SBS4", "FFPE")

Example

library(excerno)

# Load in signatures
cosmic.sigs <- get_known_signatures()
cosmic.sig4 <- as.matrix(cosmic.sigs[,4])
ffpe.sig <- get_ffpe_signature()

# Load in vcf files
vcf.files <- list.files(system.file("extdata", package = "excerno"), pattern = "SIMULATED_SAMPLE_SBS4_\\d.vcf", full.names = TRUE)
vcf.file <- "SIMULATED_SAMPLE_SBS4_1_classified.vcf"

method <- "nmf"
num.signatures <- 2
target.sigs <- matrix(nrow = 96, ncol = 2)
target.sigs[,1] <- cosmic.sig4
target.sigs[,2] <- ffpe.sig
rownames(target.sigs) <- get_mutation_types()
colnames(target.sigs) <- c("SBS4", "FFPE")

excerno_vcf(vcf.files, "linear", target.sigs = target.sigs)

Simulations

We tested the performance of our method by simulating mutations to match particular distributions and running a modified version of our classifier function that allows for the comparison of the true source of simulated mutations and the source predicted by our method. Functions for simulating mutations and classifying simulated mutations are included in the package.

Creating simulated samples

Load the mutational signatures from COSMIC version 3 using MutationalPatterns.

library(MutationalPatterns)
cosmic.sigs <- get_known_signatures()

Extract COSMIC Signature 4 from the matrix of all COSMIC mutations. Create a vector of the 96 single base substitution mutation types using `get_mutation_types and assign to the rownames of the Signature 4 matrix for compatibility with MutationalPatterns plotting functions. Use plot_96_profile from MutationalPatterns to visualize the distribution of Signature 4.

cosmic.sig4 <- as.matrix(cosmic.sigs[,4])
mutations <- get_mutation_types()
rownames(cosmic.sig4) <- mutations
plot_96_profile(cosmic.sig4)

Load the FFPE Signature using get_ffpe_signature.

ffpe.sig <- get_ffpe_signature()

Create vectors of 100 mutations matching the distributions of COSMIC Signature 4 and the FFPE Signature using create_signature_sample_vector.

sample.sig4 <- create_signature_sample_vector(cosmic.sig4, 100)
sample.ffpe <- create_signature_sample_vector(ffpe.sig, 100)

Use signature_cosine_similarity to calculate the cosine similarity between the simulated vectors above and the original signatures.

signature_cosine_similarity(sample.sig4, cosmic.sig4)
signature_cosine_similarity(sample.ffpe, ffpe.sig)

Creating a classification data frame with simulated samples

Combine the Signature 4 and FFPE sample vectors and run the Bayesian classifier on the combined sample using classify_simulated_samples.

# Turn into a list for input into classify_simulated_samples()
samples <- list(sample.sig4, sample.ffpe)
signatures <- list(cosmic.sig4, ffpe.sig)

classification.df <- classify_simulated_samples(samples, signatures)
classification.df

Converting a simulated classification data frame to a GRange Object

The function create_gr_from_sample converts a classification data frame into a GRange Object. The classification data frame must have a mutations column and a truth column. The mutations column contains strings of mutations such as "A[C>T]A". The truth column indicates what mutational signature the mutation came from.

The output GRange object is meant to parallel a VCF file so additional columns are created in the GRange object: info, quality, filter, format, and samples (if provided values). These parameters are optional and will remain as NA if no values are provided.

seq <- getSeq(Hsapiens, "chr1")
classification.gr <- create_gr_from_sample(classification.df, seq, "chr1")

# Adding values to other columns
info <- sample("SOMATIC", 200, replace = TRUE)
quality <- sample(50:100, 200, replace = TRUE)
filter <- sample("PASS", 200, replace = TRUE)
format <- sample("GT:GQ", 200, replace = TRUE)
samples <- list(sample(paste("0/0:", 1:100, sep = ""), 200, replace = TRUE), sample(paste("0/0:", 1:100, sep = ""), 200, replace = TRUE))
sample.names <- c("SAMPLE1", "SAMPLE2")

classification.gr <- create_gr_from_sample(classification.df, seq, "chr1", info, quality, filter, format, samples, sample.names)

Writing a classification data frame to a VCF file

The function write_grange_to_vcf takes in a GRange object and outputs a VCF file with the values from the GRange object.

vcf.filename <- "new_vcf.file"

write_grange_to_vcf(classification.gr, vcf.filename)

Other Functions Not Mentioned

signature_cosine_similarity

Determines how similar a mutational vector sample is to a mutational signature.

# Simulate sample
cosmic.sig4 <- as.matrix(get_known_signatures()[,4])
sample.sig4 <- create_signature_sample_vector(cosmic.sig4)

signature_cosine_similarity(sample.sig4, cosmic.sig4)

extract_all_prob

Outputs two probabilities for one type of the 96 mutations given two signatures and the contribution of each signature.

cosmic.sigs <- get_known_signatures()

# Get signatures
signatures <- matrix(nrow = 96, ncol = 2)
signatures[,1] <- cosmic.sigs[,4]
signatures[,2] <- get_ffpe_signature()

# Get contributions
contribution <- matrix(nrow = 2, ncol = 1)
contribution[,1] <- c(0.5, 0.5)

# Naming columns and rows
colnames(signatures) <- c("SBS4", "FFPE")
rownames(signatures) <- get_mutation_types()

rownames(contribution) <- c("SBS4", "FFPE")

mutation <- "A[C>T]A"
extract_all_prob(mutation, signatures, contribution)

find_signature_name

Given a signature matrix, search the COSMIC database for the most similar signature and output the name of that signature.

cosmic.sigs <- get_known_signatures()
cosmic.sig4 <- as.matrix(cosmic.sigs[,4])

find_signature_name(cosmic.sig4)

get_mutational_vector

Given a VCF file, create a vector of mutations in string form.

# Load file for testing
file <- system.file("extdata", "SIMULATED_SAMPLE_SBS4_1.vcf", package = "excerno")

vcf.vector <- get_mutational_vector(file)

write_classification_to_vcf

Add the classifcation information from a classification data frame to its orginal vcf file. Takes in the orginal VCF file and a data frame with the classifications.

vcf.file <- system.file("extdata", "SIMULATED_SAMPLE_SBS4_1.vcf", package = "excerno")

# Load in signatures
cosmic.sigs <- get_known_signatures()

# Get signatures
signatures <- matrix(nrow = 96, ncol = 2)
signatures[,1] <- cosmic.sigs[,4]
signatures[,2] <- get_ffpe_signature()
rownames(signatures) <- get_mutation_types()
colnames(signatures) <- c("SBS4", "FFPE")

# Get contributions
contribution <- matrix(nrow = 2, ncol = 1)
contribution[,1] <- c(0.5, 0.5)
rownames(contribution) <- c("SBS4", "FFPE")

classification.df <- get_classification(signatures, contribution)
write_classification_to_vcf(vcf.file, classification.df)

Data

Code for generating the samples included with this package.

SAMPLE 1: FFPE at 50% with 1000 mutations

set.seed(10)
sample.ffpe <- create_signature_sample_vector(ffpe.sig, 500)
sample.sig4 <- create_signature_sample_vector(cosmic.sig4, 500)

samples <- list(sample.sig4, sample.ffpe)
signatures <- list(cosmic.sig4, ffpe.sig)
classify.df <- classify_simulated_samples(samples, signatures)

info <- sample("SOMATIC", 1000, replace = TRUE)
quality <- sample(50:100, 1000, replace = TRUE)
filter <- sample("PASS", 1000, replace = TRUE)
format <- sample("GT:GQ", 1000, replace = TRUE)
samples <- list(sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE), sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE))
sample.names <- c("SAMPLE1", "SAMPLE2")
classify.gr <- create_gr_from_sample(classify.df, seq, "chr1", info, quality, filter, format, samples, sample.names)

file.name <- "SIMULATED_SAMPLE_SBS4_1.vcf"
write_grange_to_vcf(classify.gr, file.name)

SAMPLE 2: FFPE at 80% with 1000 mutations

set.seed(20)
sample.ffpe <- create_signature_sample_vector(ffpe.sig, 800)
sample.sig4 <- create_signature_sample_vector(cosmic.sig4, 200)

samples <- list(sample.sig4, sample.ffpe)
signatures <- list(cosmic.sig4, ffpe.sig)
classify.df <- classify_simulated_samples(samples, signatures)

info <- sample("SOMATIC", 1000, replace = TRUE)
quality <- sample(50:100, 1000, replace = TRUE)
filter <- sample("PASS", 1000, replace = TRUE)
format <- sample("GT:GQ", 1000, replace = TRUE)
samples <- list(sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE), sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE))
sample.names <- c("SAMPLE1", "SAMPLE2")
classify.gr <- create_gr_from_sample(classify.df, seq, "chr1", info, quality, filter, format, samples, sample.names)

file.name <- "SIMULATED_SAMPLE_SBS4_2.vcf"
write_grange_to_vcf(classify.gr, file.name)

SAMPLE 3: FFPE at 40% with 1000 mutations

set.seed(30)
sample.ffpe <- create_signature_sample_vector(ffpe.sig, 400)
sample.sig4 <- create_signature_sample_vector(cosmic.sig4, 600)

samples <- list(sample.sig4, sample.ffpe)
signatures <- list(cosmic.sig4, ffpe.sig)
classify.df <- classify_simulated_samples(samples, signatures)

info <- sample("SOMATIC", 1000, replace = TRUE)
quality <- sample(50:100, 1000, replace = TRUE)
filter <- sample("PASS", 1000, replace = TRUE)
format <- sample("GT:GQ", 1000, replace = TRUE)
samples <- list(sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE), sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE))
sample.names <- c("SAMPLE1", "SAMPLE2")
classify.gr <- create_gr_from_sample(classify.df, seq, "chr1", info, quality, filter, format, samples, sample.names)

file.name <- "SIMULATED_SAMPLE_SBS4_3.vcf"
write_grange_to_vcf(classify.gr, file.name)

SAMPLE 4: FFPE at 10% with 1000 mutations

set.seed(40)
sample.ffpe <- create_signature_sample_vector(ffpe.sig, 100)
sample.sig4 <- create_signature_sample_vector(cosmic.sig4, 900)

samples <- list(sample.sig4, sample.ffpe)
signatures <- list(cosmic.sig4, ffpe.sig)
classify.df <- classify_simulated_samples(samples, signatures)

info <- sample("SOMATIC", 1000, replace = TRUE)
quality <- sample(50:100, 1000, replace = TRUE)
filter <- sample("PASS", 1000, replace = TRUE)
format <- sample("GT:GQ", 1000, replace = TRUE)
samples <- list(sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE), sample(paste("0/0:", 1:100, sep = ""), 1000, replace = TRUE))
sample.names <- c("SAMPLE1", "SAMPLE2")
classify.gr <- create_gr_from_sample(classify.df, seq, "chr1", info, quality, filter, format, samples, sample.names)

file.name <- "SIMULATED_SAMPLE_SBS4_4.vcf"
write_grange_to_vcf(classify.gr, file.name)


popopo19/excerno documentation built on Aug. 28, 2022, 1:23 a.m.