determine_regional_similarity: Determine regional mutation pattern similarity

View source: R/determine_regional_similarity.R

determine_regional_similarityR Documentation

Determine regional mutation pattern similarity

Description

Calculate the cosine similarities between the global mutation profile and the mutation profile of smaller genomic windows, using a sliding window approach. Regions with a very different mutation profile can be identified in this way. This function generally requires many mutations (~100,000) to work properly.

Usage

determine_regional_similarity(
  vcf,
  ref_genome,
  chromosomes,
  window_size = 100,
  stepsize = 25,
  extension = 1,
  oligo_correction = FALSE,
  exclude_self_mut_mat = TRUE,
  max_window_size_gen = 2e+07,
  verbose = FALSE
)

Arguments

vcf

GRanges object

ref_genome

BSgenome reference genome object

chromosomes

Vector of chromosome/contig names of the reference genome to be plotted.

window_size

The number of mutations in a window. (Default: 100)

stepsize

The number of mutations that a window slides in each step. (Default: 25)

extension

The number of bases, that's extracted upstream and downstream of the base substitutions, to create the mutation matrices. (Default: 1).

oligo_correction

Boolean describing whether oligonucleotide frequency correction should be applied. (Default: FALSE)

exclude_self_mut_mat

Boolean describing whether the mutations in a window should be subtracted from the global mutation matrix. (Default: TRUE)

max_window_size_gen

The maximum size of a window before it is removed. (Default: 20,000,000)

verbose

Boolean determining the verbosity of the function. (Default: FALSE)

Details

First a global mutation matrix is calculated using all mutations. Next, a sliding window is used. This means that we create a window containing the first x mutations. The cosine similarity, between the mutation profiles of this window and the global mutation matrix, is then calculated. The window then slides y mutations to the right and the cosine similarity is again calculated. This process is repeated until the final mutation on a chromosome is reached. This process is performed separately per chromosome. Windows that span a too large region of the genome are removed, because they are unlikely to contain biologically relevant information.

The number of mutations that the window slides to the right in each step is called the stepsize. The best stepsize depends on the window size. In general, we recommend setting the stepsize between 25 window size.

The analysis can be performed for trinucleotides contexts, for a larger context, or for just the base substitutions. A smaller context might miss detailed differences in mutation profiles, but is also less noisy. We recommend using a smaller extension when analyzing small datasets.

It's possible to correct for the oligonucleotide frequency of the windows. This is done by calculating the cosine similarity of the oligonucleotide frequency between each window and the genome. The cosine similarity of the mutation profiles is then divided by the oligonucleotide similarity. This ensures that regions with an abnormal oligonucleotide frequency don't show up as having a very different profile. The oligonucleotide frequency correction slows down the function, so we advise the user to keep it off for exploratory analyses and to only turn it on to validate interesting results.

By default the mutations in a window are subtracted from the global mutation matrix, before calculating the cosine similarity. This increases sensitivity, but could also decrease specificity. This subtraction can be turned of with the 'exclude_self_mut_mat' argument.

Value

A "region_cossim" object containing both the cosine similarities and the settings used in this analysis.

See Also

plot_regional_similarity

Other regional_similarity: plot_regional_similarity()

Examples


## See the 'read_vcfs_as_granges()' example for how we obtained the
## following data:
grl <- readRDS(system.file("states/read_vcfs_as_granges_output.rds",
  package = "MutationalPatterns"
))

## We pool all the variants together, because the function doesn't work well
## with a limited number of mutations. Still, in practice we recommend to use
## more mutations that in this example.
gr = unlist(grl)

## Specifiy the chromosomes of interest.
chromosomes <- names(genome(gr)[1:3])

 ## Load the corresponding reference genome.
ref_genome <- "BSgenome.Hsapiens.UCSC.hg19"
library(ref_genome, character.only = TRUE)

## Determine the regional similarities. Here we use a small window size to make the function work.
## In practice, we recommend a larger window size.
regional_sims = determine_regional_similarity(gr,
  ref_genome, 
  chromosomes,
  window_size = 40,
  stepsize = 10,
  max_window_size_gen = 40000000
)

## Here we use an extensiof of 0 to reduce noise.
## We also turned verbosity on, so you can see at what step the function is.
## This can be useful on large datasets.
regional_sims_0_extension = determine_regional_similarity(gr,
  ref_genome, 
  chromosomes,
  window_size = 40,
  stepsize = 10,
  extension = 0,
  max_window_size_gen = 40000000,
  verbose = TRUE
)

UMCUGenetics/MutationalPatterns documentation built on Nov. 24, 2022, 4:31 a.m.