View source: R/compare2Sequences.R
compare2Sequences | R Documentation |
Generate all possible guide RNAs (gRNAs) for two input sequences, or two sets of sequences, and generate scores for potential off-targets in the other sequence.
compare2Sequences(
inputFile1Path,
inputFile2Path,
inputNames = c("Seq1", "Seq2"),
format = c("fasta", "fasta"),
header = FALSE,
findgRNAsWithREcutOnly = FALSE,
searchDirection = c("both", "1to2", "2to1"),
BSgenomeName,
baseEditing = FALSE,
targetBase = "C",
editingWindow = 4:8,
editingWindow.offtargets = 4:8,
REpatternFile = system.file("extdata", "NEBenzymes.fa", package = "CRISPRseek"),
minREpatternSize = 6,
findgRNAs = c(TRUE, TRUE),
removegRNADetails = c(FALSE, FALSE),
exportAllgRNAs = c("no", "all", "fasta", "genbank"),
annotatePaired = FALSE,
overlap.gRNA.positions = c(17, 18),
findPairedgRNAOnly = FALSE,
min.gap = 0,
max.gap = 20,
gRNA.name.prefix = "_gR",
PAM.size = 3,
gRNA.size = 20,
PAM = "NGG",
PAM.pattern = "NNG$|NGN$",
allowed.mismatch.PAM = 1,
max.mismatch = 3,
outputDir,
upstream = 0,
downstream = 0,
weights = c(0, 0, 0.014, 0, 0, 0.395, 0.317, 0, 0.389, 0.079, 0.445, 0.508, 0.613,
0.851, 0.732, 0.828, 0.615, 0.804, 0.685, 0.583),
overwrite = FALSE,
baseBeforegRNA = 4,
baseAfterPAM = 3,
featureWeightMatrixFile = system.file("extdata", "DoenchNBT2014.csv", package =
"CRISPRseek"),
foldgRNAs = FALSE,
gRNA.backbone = "GUUUUAGAGCUAGAAAUAGCAAGUUAAAAUAAGGCUAGUCCGUUAUCAACUUGAAAAAGUGGCACCGAGUCGGUGCUUUUUU",
temperature = 37,
scoring.method = c("Hsu-Zhang", "CFDscore"),
subPAM.activity = hash(AA = 0, AC = 0, AG = 0.259259259, AT = 0, CA = 0, CC = 0, CG =
0.107142857, CT = 0, GA = 0.069444444, GC = 0.022222222, GG = 1, GT = 0.016129032, TA
= 0, TC = 0, TG = 0.038961039, TT = 0),
subPAM.position = c(22, 23),
PAM.location = "3prime",
rule.set = c("Root_RuleSet1_2014", "Root_RuleSet2_2016", "CRISPRscan", "DeepCpf1"),
mismatch.activity.file = system.file("extdata",
"NatureBiot2016SuppTable19DoenchRoot.csv", package = "CRISPRseek")
)
inputFile1Path |
Sequence input file 1 path that contains one of the two sequences to be searched for potential gRNAs. It can also be a DNAStringSet object with names field set. Please see examples below. |
inputFile2Path |
Sequence input file 2 path that contains one of the two sequences to be searched for potential gRNAs. It can also be a DNAStringSet object with names field set. Please see examples below. |
inputNames |
Name of the input sequences when inputFile1Path and inputFile2Path are DNAStringSet instead of file path |
format |
Format of the input files, fasta, fastq and bed format are supported, default fasta |
header |
Indicate whether the input file contains header, default FALSE, only applies to bed format |
findgRNAsWithREcutOnly |
Indicate whether to find gRNAs overlap with restriction enzyme recognition pattern |
searchDirection |
Indicate whether perfrom gRNA in both sequences and off-target search against each other (both) or search gRNA in input1 and off-target analysis in input2 (1to2), or vice versa (2to1) |
BSgenomeName |
BSgenome object. Please refer to available.genomes in BSgenome package. For example, BSgenome.Hsapiens.UCSC.hg19 for hg19, BSgenome.Mmusculus.UCSC.mm10 for mm10, BSgenome.Celegans.UCSC.ce6 for ce6, BSgenome.Rnorvegicus.UCSC.rn5 for rn5, BSgenome.Drerio.UCSC.danRer7 for Zv9, and BSgenome.Dmelanogaster.UCSC.dm3 for dm3 |
baseEditing |
Indicate whether to design gRNAs for base editing. Default to FALSE If TRUE, please set baseEditing = TRUE, targetBase and editingWidow accordingly. |
targetBase |
Applicable only when baseEditing is set to TRUE. It is used to indicate the target base for base editing systems, default to C for converting C to T in the CBE system. Please change it to A if you intend to use the ABE system. |
editingWindow |
Applicable only when baseEditing is set to TRUE. It is used to indicate the effective editing window, default to 4 to 8 which is for the original CBE system. Please change it accordingly if the system you use have a different editing window. |
editingWindow.offtargets |
Applicable only when baseEditing is set to TRUE. It is used to indicate the effective editing window to consider for the offtargets search only, default to 4 to 8 which is for the original CBE system. Please change it accordingly if the system you use have a different editing window, or you would like to include offtargets with the target base in a larger editing window. |
REpatternFile |
File path containing restriction enzyme cut patters |
minREpatternSize |
Minimum restriction enzyme recognition pattern length required for the enzyme pattern to be searched for, default 6 |
findgRNAs |
Indicate whether to find gRNAs from the sequences in the input file or skip the step of finding gRNAs, default TRUE for both input sequences. Set it to FALSE if the input file contains user selected gRNAs plus PAM already. |
removegRNADetails |
Indicate whether to remove the detailed gRNA information such as efficacy file and restriction enzyme cut sites, default false for both input sequences. Set it to TRUE if the input file contains the user selected gRNAs plus PAM already. |
exportAllgRNAs |
Indicate whether to output all potential gRNAs to a file in fasta format, genbank format or both. Default to no. |
annotatePaired |
Indicate whether to output paired information, default to FALSE |
overlap.gRNA.positions |
The required overlap positions of gRNA and restriction enzyme cut site, default 17 and 18 |
findPairedgRNAOnly |
Choose whether to only search for paired gRNAs in such an orientation that the first one is on minus strand called reverse gRNA and the second one is on plus strand called forward gRNA. TRUE or FALSE, default FALSE |
min.gap |
Minimum distance between two oppositely oriented gRNAs to be valid paired gRNAs. Default 0 |
max.gap |
Maximum distance between two oppositely oriented gRNAs to be valid paired gRNAs. Default 20 |
gRNA.name.prefix |
The prefix used when assign name to found gRNAs, default _gR, short for guided RNA. |
PAM.size |
PAM length, default 3 |
gRNA.size |
The size of the gRNA, default 20 |
PAM |
PAM sequence after the gRNA, default NGG |
PAM.pattern |
Regular expression of PAM, default NNG or NGN for spCas9. For cpf1, ^TTTN since it is a 5 prime PAM sequence |
allowed.mismatch.PAM |
Maximum number of mismatches allowed to the PAM sequence, default to 1 for PAM.pattern NNG or NGN PAM |
max.mismatch |
Maximum mismatch allowed to search the off targets in the other sequence, default 3 |
outputDir |
the directory where the sequence comparison results will be written to |
upstream |
upstream offset from the bed input starts to search for gRNA and/or offtargets, default 0 |
downstream |
downstream offset from the bed input ends to search for gRNA and/or offtargets, default 0 |
weights |
numeric vector size of gRNA length, default c(0, 0, 0.014, 0, 0, 0.395, 0.317, 0, 0.389, 0.079, 0.445, 0.508, 0.613, 0.851, 0.732, 0.828, 0.615, 0.804, 0.685, 0.583) which is used in Hsu et al., 2013 cited in the reference section |
overwrite |
overwrite the existing files in the output directory or not, default TRUE |
baseBeforegRNA |
Number of bases before gRNA used for calculating gRNA efficiency, default 4 Please note, for PAM located on the 5 prime, need to specify the number of bases before the PAM sequence plus PAM size. |
baseAfterPAM |
Number of bases after PAM used for calculating gRNA efficiency, default 3 for spCas9 Please note, for PAM located on the 5 prime, need to include the length of the gRNA plus the extended sequence on the 3 prime |
featureWeightMatrixFile |
Feature weight matrix file used for calculating gRNA efficiency. By default DoenchNBT2014 weight matrix is used. To use alternative weight matrix file, please input a csv file with first column containing significant features and the second column containing the corresponding weights for the features. Please see Doench et al., 2014 for details. |
foldgRNAs |
Default FALSE. If set to TRUE, summary file will contain minimum free energy of the secondary structure of gRNA with gRNA backbone from GeneRfold package provided that GeneRfold package has been installed. |
gRNA.backbone |
gRNA backbone constant region sequence. Default to the sequence in Sp gRNA backbone. |
temperature |
temperature in celsius. Default to 37 celsius. |
scoring.method |
Indicates which method to use for offtarget cleavage rate estimation, currently two methods are supported, Hsu-Zhang and CFDscore |
subPAM.activity |
Applicable only when scoring.method is set to CFDscore A hash to represent the cleavage rate for each alternative sub PAM sequence relative to preferred PAM sequence |
subPAM.position |
Applicable only when scoring.method is set to CFDscore The start and end positions of the sub PAM. Default to 22 and 23 for SP with 20bp gRNA and NGG as preferred PAM |
PAM.location |
PAM location relative to gRNA. For example, spCas9 PAM is located on the 3 prime (3prime) while cpf1 PAM is located on the 5 prime (5prime) |
rule.set |
Specify a rule set scoring system for calculating gRNA efficacy. Please note that Root_RuleSet2_2016 requires the following python packages with specified verion and python 2.7. 1. scikit-learn 0.16.1 2. pickle 3. pandas 4. numpy 5. scipy |
mismatch.activity.file |
Applicable only when scoring.method is set to CFDscore A comma separated (csv) file containing the cleavage rates for all possible types of single nucleotide mismatche at each position of the gRNA. By default, using the supplemental Table 19 from Doench et al., Nature Biotechnology 2016 |
Return a data frame with all potential gRNAs from both sequences. In addition, a tab delimited file scoresFor2InputSequences.xls is also saved in the outputDir, sorted by scoreDiff descending.
name |
name of the gRNA |
gRNAPlusPAM |
gRNA plus PAM sequence |
targetInSeq1 |
target/off-target sequence including PAM in the 1st input sequence file |
targetInSeq2 |
target/off-target sequence incuding PAM in the 2nd input sequence file |
guideAlignment2Offtarget |
alignment of gRNA to the other input sequence (off-target sequence) |
offTargetStrand |
strand of the other sequence (off-target sequence) the gRNA align to |
scoreForSeq1 |
score for the target sequence in the 1st input sequence file |
scoreForSeq2 |
score for the target sequence in the 1st input sequence file |
mismatch.distance2PAM |
distances of mismatch to PAM, e.g., 14 means the mismatch is 14 bp away from PAM |
n.mismatch |
number of mismatches between the off-target and the gRNA |
targetSeqName |
the name of the input sequence where the target sequence is located |
scoreDiff |
scoreForSeq1 - scoreForSeq2 |
bracket.notation |
folded gRNA in bracket notation |
mfe.sgRNA |
minimum free energy of sgRNA |
mfe.diff |
mfe.sgRNA-mfe.backbone |
mfe.backbone |
minimum free energy of the gRNA backbone by itself |
Lihua Julie Zhu
Patrick D Hsu, David A Scott, Joshua A Weinstein, F Ann Ran, Silvana Konermann, Vineeta Agarwala, Yinqing Li, Eli J Fine, Xuebing Wu, Ophir Shalem, Thomas J Cradick, Luciano A Marraffini, Gang Bao & Feng Zhang (2013) DNA targeting specificity of rNA-guided Cas9 nucleases. Nature Biotechnology 31:827-834
CRISPRseek
library(CRISPRseek)
inputFile1Path <- system.file("extdata", "rs362331T.fa",
package = "CRISPRseek")
inputFile2Path <- system.file("extdata", "rs362331C.fa",
package = "CRISPRseek")
REpatternFile <- system.file("extdata", "NEBenzymes.fa",
package = "CRISPRseek")
seqs <- compare2Sequences(inputFile1Path, inputFile2Path,
outputDir = getwd(),
REpatternFile = REpatternFile, overwrite = TRUE)
seqs2 <- compare2Sequences(inputFile1Path, inputFile2Path,
inputNames=c("Seq1", "Seq2"),
scoring.method = "CFDscore",
outputDir = getwd(),
overwrite = TRUE, baseEditing = TRUE)
inputFile1Path <-
DNAStringSet(
"TAATATTTTAAAATCGGTGACGTGGGCCCAAAACGAGTGCAGTTCCAAAGGCACCCACCTGTGGCAG"
)
## when set inputFile1Path to a DNAStringSet object, it is important
## to call names
names(inputFile1Path) <- "seq1"
inputFile2Path <-
DNAStringSet(
"TAATATTTTAAAATCGGTGACGTGGGCCCAAAACGAGTGCAGTTCCAAAGGCACCCACCTGTGGCAG"
)
## when set inputFile2Path to a DNAStringSet object, it is important
## to call names
names(inputFile2Path) <- "seq2"
seqs <- compare2Sequences(inputFile1Path, inputFile2Path,
inputNames=c("Seq1", "Seq2"),
scoring.method = "CFDscore",
outputDir = getwd(),
overwrite = TRUE)
seqs2 <- compare2Sequences(inputFile1Path, inputFile2Path,
inputNames=c("Seq1", "Seq2"),
scoring.method = "CFDscore",
outputDir = getwd(),
overwrite = TRUE, baseEditing = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.