Contains three main functions for analysing microhomology sequences in CRISPR/Cas9 deletions:
mhq
Quantifies micorhomologies in CRISPR/Cas9 deletions. Accepts targeted amplicon next-generation sequencing data analysed using CRISPResso (1.0.x, Pinello et al., 2016), or accepts any other sequencing data (eg Sanger sequencing) processed using a local alignment tool (e.g. MUSCLE).
gcq
Quantifies GC content of microhomologies of different lengths. Performs statistical test of whether GC bases are enriched compared to an expected background GC content.
amh
Counts alternative microhomologies in Sanger sequencing data of CRISPR/Cas9 deletions analysed using mhq. Note: when analysing CRISPResso data, alternative microhomology count is automatically calculated by the mhq function and this function does not need to be run separately.
Results <- mhq(input="~/exampleData/CRISPResso/", CRSISPresso=TRUE)
Output is a dataframe (or list of dataframes) containing tab seperated columns: - MutantSequence (already determined by CRISPresso) - ReferenceSequence (already determined by CRISPresso) - SizeOfDeletion (already determined by CRISPresso) - NumberOfReads (already determined by CRISPresso) - MH_amount (microhomology amount found, if any, or "No_MH") - MH_sequence (microhomology sequence found, if any, or "No_MH") - altMH_count (alternative microhomologies found within the deleted sequence)
Writes analysed data files in /path/to/directory/MHQuant_out/MHQuant_out_Allele_Frequency_table.txt
mhq(input="~/exampleData/Sanger/sequenceDataFile.txt", CRSISPresso=FALSE)
sequenceDataFile.txt should contain sequences of deletion alleles that were already mapped and aligned using a local alignment tool such as MUSCLE. It needs five tab-seperated columns containing the DNA sequences of the breakpoints to be analysed. The length of each of the DNA sequences can be varied between rows, but must be the same between columns. For example:
Example1 CGTGGCGAGG GCTGAGCTAT TGTTAGCACA GCTTCTCCA
Example2 CGTGGCGAGGCGTGG GCTGAGCTATGCTAT TGTTAGCACAGCACA GCTTCTCCACTCCA
where: - column1 = Sequence name - column2 = 5' sequence not included in the deletion - column3 = 5' sequence included in the deletion - column4 = 3' sequence included in the deletion - column5 = 3' sequence not included in the deletion
Output is a dataframe containing the original five columns as well as these additional columns: - MH_amount (microhomology amount found, if any, or "No_MH") - MH_sequence (microhomology sequence found, if any, or "No_MH")
gcq(mhqOutCRISPResso, MH=2, equalTo=F, expected=0.46, CRISPResso=T)
gcq(mhqOutSanger, MH=3, equalTo=T, expected=0.56, CRISPResso=F)
Output is a dataframe with columns: - baseType = GC bases or AT bases - baseNum = number of bases of each type (in alleles with the given amount of microhomology) - baseProb = observed number of bases of each type in the microhomologies analysed (0 to 1 a.k.a 0 to 100 - expectedProb = expected probability (0 to 1) of bases of each type (known background for the region of the deletions - determined by the user) - pval = chi square test p value (chance of finding the observed vs expected probability)
library(BSgenome.Mmusculus.UCSC.mm9)
genome <- BSgenome.Mmusculus.UCSC.mm9
amh(input="~/exampleData/Sanger/sequenceDataFile_altMH.txt", genome)
Input text file columns required: - column1 = Sequence name (same as input/output of mhq function) - column2 = 5' sequence not included in the deletion (same as input/output of mhq function) - column3 = 5' sequence included in the deletion (same as input/output of mhq function) - column4 = 3' sequence included in the deletion (same as input/output of mhq function) - column5 = 3' sequence not included in the deletion (same as input/output of mhq function) - column6 = MH_amount (microhomology amount found, if any, or "No_MH") (same as output of mhq function) - column7 = MH_sequence (microhomology sequence found, if any, or "No_MH") (same as output of mhq function) - column8 = LD_chr - Additional: chromosome of deletion start - column9 = LD_start - Additional: bed coordinate start of deletion span - column10 = LD_stop - Additional: bed coordinate end of deletion span - column11 = strand - Additional: strand that deletion was mapped to (or strand that microhomology is reported for) - column12 = sg_start - Additional: bed coordinate start of sgRNAs span (if pairs of sgRNAs used, then start coordinate of 5' most sgRNA) - column13 = sg_stop - Additional: bed coordinate end of sgRNAs span (if pairs of sgRNAs used, then end coordinate of 3' most sgRNA)
To gather the bed format information for the additional columns, a tool like UCSC BLAT can be used.
Output is a dataframe containing the original data and two additional columns: - altMH_count - Number of alternative microhomologies found - add_delSize - Additional size of deletion beyond sgRNAs (same as deletion size when using one sgRNA)
if (!requireNamespace("devtools", quietly = TRUE))
install.packages("devtools")
devtools::install_github("d0minicO/mhscanR")
library(mhscanR)
Requires Biostrings, stringi, stringr, BSgenome, GenomicRanges and tidyverse. If the mhscanR installation fails, see http://bioconductor.org/ and https://www.tidyverse.org/ for help installing these packages first.
Queries, bugs, or discussions welcome: dominic.owens@utoronto.ca
MIT
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.