View source: R/determine_regional_similarity.R
determine_regional_similarity | R Documentation |
Calculate the cosine similarities between the global mutation profile and the mutation profile of smaller genomic windows, using a sliding window approach. Regions with a very different mutation profile can be identified in this way. This function generally requires many mutations (~100,000) to work properly.
determine_regional_similarity( vcf, ref_genome, chromosomes, window_size = 100, stepsize = 25, extension = 1, oligo_correction = FALSE, exclude_self_mut_mat = TRUE, max_window_size_gen = 2e+07, verbose = FALSE )
vcf |
GRanges object |
ref_genome |
BSgenome reference genome object |
chromosomes |
Vector of chromosome/contig names of the reference genome to be plotted. |
window_size |
The number of mutations in a window. (Default: 100) |
stepsize |
The number of mutations that a window slides in each step. (Default: 25) |
extension |
The number of bases, that's extracted upstream and downstream of the base substitutions, to create the mutation matrices. (Default: 1). |
oligo_correction |
Boolean describing whether oligonucleotide frequency correction should be applied. (Default: FALSE) |
exclude_self_mut_mat |
Boolean describing whether the mutations in a window should be subtracted from the global mutation matrix. (Default: TRUE) |
max_window_size_gen |
The maximum size of a window before it is removed. (Default: 20,000,000) |
verbose |
Boolean determining the verbosity of the function. (Default: FALSE) |
First a global mutation matrix is calculated using all mutations. Next, a sliding window is used. This means that we create a window containing the first x mutations. The cosine similarity, between the mutation profiles of this window and the global mutation matrix, is then calculated. The window then slides y mutations to the right and the cosine similarity is again calculated. This process is repeated until the final mutation on a chromosome is reached. This process is performed separately per chromosome. Windows that span a too large region of the genome are removed, because they are unlikely to contain biologically relevant information.
The number of mutations that the window slides to the right in each step is called the stepsize. The best stepsize depends on the window size. In general, we recommend setting the stepsize between 25 window size.
The analysis can be performed for trinucleotides contexts, for a larger context, or for just the base substitutions. A smaller context might miss detailed differences in mutation profiles, but is also less noisy. We recommend using a smaller extension when analyzing small datasets.
It's possible to correct for the oligonucleotide frequency of the windows. This is done by calculating the cosine similarity of the oligonucleotide frequency between each window and the genome. The cosine similarity of the mutation profiles is then divided by the oligonucleotide similarity. This ensures that regions with an abnormal oligonucleotide frequency don't show up as having a very different profile. The oligonucleotide frequency correction slows down the function, so we advise the user to keep it off for exploratory analyses and to only turn it on to validate interesting results.
By default the mutations in a window are subtracted from the global mutation matrix, before calculating the cosine similarity. This increases sensitivity, but could also decrease specificity. This subtraction can be turned of with the 'exclude_self_mut_mat' argument.
A "region_cossim" object containing both the cosine similarities and the settings used in this analysis.
plot_regional_similarity
Other regional_similarity:
plot_regional_similarity()
## See the 'read_vcfs_as_granges()' example for how we obtained the ## following data: grl <- readRDS(system.file("states/read_vcfs_as_granges_output.rds", package = "MutationalPatterns" )) ## We pool all the variants together, because the function doesn't work well ## with a limited number of mutations. Still, in practice we recommend to use ## more mutations that in this example. gr = unlist(grl) ## Specifiy the chromosomes of interest. chromosomes <- names(genome(gr)[1:3]) ## Load the corresponding reference genome. ref_genome <- "BSgenome.Hsapiens.UCSC.hg19" library(ref_genome, character.only = TRUE) ## Determine the regional similarities. Here we use a small window size to make the function work. ## In practice, we recommend a larger window size. regional_sims = determine_regional_similarity(gr, ref_genome, chromosomes, window_size = 40, stepsize = 10, max_window_size_gen = 40000000 ) ## Here we use an extensiof of 0 to reduce noise. ## We also turned verbosity on, so you can see at what step the function is. ## This can be useful on large datasets. regional_sims_0_extension = determine_regional_similarity(gr, ref_genome, chromosomes, window_size = 40, stepsize = 10, extension = 0, max_window_size_gen = 40000000, verbose = TRUE )
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.