GUIDEseqAnalysis: Analysis pipeline for GUIDE-seq dataset
In LihuaJulieZhu/GUIDEseq: GUIDE-seq and PEtag-seq analysis pipeline

GUIDEseqAnalysis

R Documentation

Analysis pipeline for GUIDE-seq dataset

Description

A wrapper function that uses the UMI sequence plus the first few bases of each sequence from R1 reads to estimate the starting sequence library, piles up reads with a user defined window and step size, identify the insertion sites (proxy of cleavage sites), merge insertion sites from plus strand and minus strand, followed by off target analysis of extended regions around the identified insertion sites.

Usage

GUIDEseqAnalysis(
  alignment.inputfile,
  umi.inputfile,
  alignment.format = c("auto", "bam", "bed"),
  umi.header = FALSE,
  read.ID.col = 1L,
  umi.col = 2L,
  umi.sep = "\t",
  BSgenomeName,
  gRNA.file,
  outputDir,
  n.cores.max = 1L,
  keep.chrM = FALSE,
  keep.R1only = TRUE,
  keep.R2only = TRUE,
  concordant.strand = TRUE,
  max.paired.distance = 1000L,
  min.mapping.quality = 30L,
  max.R1.len = 130L,
  max.R2.len = 130L,
  min.umi.count = 1L,
  max.umi.count = 100000L,
  min.read.coverage = 1L,
  apply.both.max.len = FALSE,
  same.chromosome = TRUE,
  distance.inter.chrom = -1L,
  min.R1.mapped = 20L,
  min.R2.mapped = 20L,
  apply.both.min.mapped = FALSE,
  max.duplicate.distance = 0L,
  umi.plus.R1start.unique = TRUE,
  umi.plus.R2start.unique = TRUE,
  window.size = 20L,
  step = 20L,
  bg.window.size = 5000L,
  min.reads = 5L,
  min.reads.per.lib = 1L,
  min.peak.score.1strandOnly = 5L,
  min.SNratio = 2,
  maxP = 0.01,
  stats = c("poisson", "nbinom"),
  p.adjust.methods = c("none", "BH", "holm", "hochberg", "hommel", "bonferroni", "BY",
    "fdr"),
  distance.threshold = 40L,
  max.overlap.plusSig.minusSig = 30L,
  plus.strand.start.gt.minus.strand.end = TRUE,
  keepPeaksInBothStrandsOnly = TRUE,
  gRNA.format = "fasta",
  overlap.gRNA.positions = c(17, 18),
  upstream = 25L,
  downstream = 25L,
  PAM.size = 3L,
  gRNA.size = 20L,
  PAM = "NGG",
  PAM.pattern = "NNN$",
  max.mismatch = 6L,
  allowed.mismatch.PAM = 2L,
  overwrite = TRUE,
  weights = c(0, 0, 0.014, 0, 0, 0.395, 0.317, 0, 0.389, 0.079, 0.445, 0.508, 0.613,
    0.851, 0.732, 0.828, 0.615, 0.804, 0.685, 0.583),
  orderOfftargetsBy = c("peak_score", "predicted_cleavage_score", "n.guide.mismatch"),
  descending = TRUE,
  keepTopOfftargetsOnly = TRUE,
  keepTopOfftargetsBy = c("predicted_cleavage_score", "n.mismatch"),
  scoring.method = c("Hsu-Zhang", "CFDscore"),
  subPAM.activity = hash(AA = 0, AC = 0, AG = 0.259259259, AT = 0, CA = 0, CC = 0, CG =
    0.107142857, CT = 0, GA = 0.069444444, GC = 0.022222222, GG = 1, GT = 0.016129032, TA
    = 0, TC = 0, TG = 0.038961039, TT = 0),
  subPAM.position = c(22, 23),
  PAM.location = "3prime",
  mismatch.activity.file = system.file("extdata",
    "NatureBiot2016SuppTable19DoenchRoot.csv", package = "CRISPRseek"),
  bulge.activity.file = system.file("extdata",
    "NatureBiot2016SuppTable19DoenchRoot.xlsx", package = "GUIDEseq"),
  txdb,
  orgAnn,
  mat,
  includeBulge = FALSE,
  max.n.bulge = 2L,
  min.peak.score.bulge = 60L,
  removeDuplicate = TRUE,
  resume = FALSE,
  ignoreTagmSite = FALSE,
  ignoreUMI = FALSE
)

Arguments

`alignment.inputfile`	The alignment file. Currently supports bam and bed output file with CIGAR information. Suggest run the workflow binReads.sh, which sequentially runs barcode binning, adaptor removal, alignment to genome, alignment quality filtering, and bed file conversion. Please download the workflow function and its helper scripts at http://mccb.umassmed.edu/GUIDE-seq/binReads/
`umi.inputfile`	A text file containing at least two columns, one is the read identifier and the other is the UMI or UMI plus the first few bases of R1 reads. Suggest use getUMI.sh to generate this file. Please download the script and its helper scripts at http://mccb.umassmed.edu/GUIDE-seq/getUMI/
`alignment.format`	The format of the alignment input file. Default bed file format. Currently only bed file format is supported, which is generated from binReads.sh
`umi.header`	Indicates whether the umi input file contains a header line or not. Default to FALSE
`read.ID.col`	The index of the column containing the read identifier in the umi input file, default to 1
`umi.col`	The index of the column containing the umi or umi plus the first few bases of sequence from the R1 reads, default to 2
`umi.sep`	column separator in the umi input file, default to tab
`BSgenomeName`	BSgenome object. Please refer to available.genomes in BSgenome package. For example, BSgenome.Hsapiens.UCSC.hg19 for hg19, BSgenome.Mmusculus.UCSC.mm10 for mm10, BSgenome.Celegans.UCSC.ce6 for ce6, BSgenome.Rnorvegicus.UCSC.rn5 for rn5, BSgenome.Drerio.UCSC.danRer7 for Zv9, and BSgenome.Dmelanogaster.UCSC.dm3 for dm3
`gRNA.file`	gRNA input file path or a DNAStringSet object that contains the target sequence (gRNA plus PAM)
`outputDir`	the directory where the off target analysis and reports will be written to
`n.cores.max`	Indicating maximum number of cores to use in multi core mode, i.e., parallel processing, default 1 to disable multicore processing for small dataset.
`keep.chrM`	Specify whether to include alignment from chrM. Default FALSE
`keep.R1only`	Specify whether to include alignment with only R1 without paired R2. Default TRUE
`keep.R2only`	Specify whether to include alignment with only R2 without paired R1. Default TRUE
`concordant.strand`	Specify whether the R1 and R2 should be aligned to the same strand or opposite strand. Default opposite.strand (TRUE)
`max.paired.distance`	Specify the maximum distance allowed between paired R1 and R2 reads. Default 1000 bp
`min.mapping.quality`	Specify min.mapping.quality of acceptable alignments
`max.R1.len`	The maximum retained R1 length to be considered for downstream analysis, default 130 bp. Please note that default of 130 works well when the read length 150 bp. Please also note that retained R1 length is not necessarily equal to the mapped R1 length
`max.R2.len`	The maximum retained R2 length to be considered for downstream analysis, default 130 bp. Please note that default of 130 works well when the read length 150 bp. Please also note that retained R2 length is not necessarily equal to the mapped R2 length
`min.umi.count`	To specify the minimum total count for a umi at the genome level to be included in the subsequent analysis. For example, with min.umi.count set to 2, if a umi only has 1 read in the entire genome, then that umi will be excluded for the subsequent analysis. Please adjust it to a higher number for deeply sequenced library and vice versa.
`max.umi.count`	To specify the maximum count for a umi to be included in the subsequent analysis. Please adjust it to a higher number for deeply sequenced library and vice versa.
`min.read.coverage`	To specify the minimum coverage for a read UMI combination to be included in the subsequent analysis. Please note that this is different from min.umi.count which is less stringent.
`apply.both.max.len`	Specify whether to apply maximum length requirement to both R1 and R2 reads, default FALSE
`same.chromosome`	Specify whether the paired reads are required to align to the same chromosome, default TRUE
`distance.inter.chrom`	Specify the distance value to assign to the paired reads that are aligned to different chromosome, default -1
`min.R1.mapped`	The minimum mapped R1 length to be considered for downstream analysis, default 30 bp.
`min.R2.mapped`	The minimum mapped R2 length to be considered for downstream analysis, default 30 bp.
`apply.both.min.mapped`	Specify whether to apply minimum mapped length requirement to both R1 and R2 reads, default FALSE
`max.duplicate.distance`	Specify the maximum distance apart for two reads to be considered as duplicates, default 0. Currently only 0 is supported
`umi.plus.R1start.unique`	To specify whether two mapped reads are considered as unique if both containing the same UMI and same alignment start for R1 read, default TRUE.
`umi.plus.R2start.unique`	To specify whether two mapped reads are considered as unique if both containing the same UMI and same alignment start for R2 read, default TRUE.
`window.size`	window size to calculate coverage
`step`	step size to calculate coverage
`bg.window.size`	window size to calculate local background
`min.reads`	minimum number of reads to be considered as a peak
`min.reads.per.lib`	minimum number of reads in each library (usually two libraries) to be considered as a peak
`min.peak.score.1strandOnly`	Specify the minimum number of reads for a one-strand only peak to be included in the output. Applicable when set keepPeaksInBothStrandsOnly to FALSE and there is only one library per sample
`min.SNratio`	Specify the minimum signal noise ratio to be called as peaks, which is the coverage normalized by local background.
`maxP`	Specify the maximum p-value to be considered as significant
`stats`	Statistical test, currently only poisson is implemented
`p.adjust.methods`	Adjustment method for multiple comparisons, default none
`distance.threshold`	Specify the maximum gap allowed between the plus strand and the negative strand peak, default 40. Suggest set it to twice of window.size used for peak calling.
`max.overlap.plusSig.minusSig`	Specify the cushion distance to allow sequence error and inprecise integration Default to 30 to allow at most 10 (30-window.size 20) bp (half window) of minus-strand peaks on the right side of plus-strand peaks. Only applicable if plus.strand.start.gt.minus.strand.end is set to TRUE.
`plus.strand.start.gt.minus.strand.end`	Specify whether plus strand peak start greater than the paired negative strand peak end. Default to TRUE
`keepPeaksInBothStrandsOnly`	Indicate whether only keep peaks present in both strands as specified by plus.strand.start.gt.minus.strand.end, max.overlap.plusSig.minusSig and distance.threshold.
`gRNA.format`	Format of the gRNA input file. Currently, fasta is supported
`overlap.gRNA.positions`	The required overlap positions of gRNA and restriction enzyme cut site, default 17 and 18 for SpCas9.
`upstream`	upstream offset from the peak start to search for off targets, default 25 suggest set it to window size
`downstream`	downstream offset from the peak end to search for off targets, default 25 suggest set it to window size
`PAM.size`	PAM length, default 3
`gRNA.size`	The size of the gRNA, default 20
`PAM`	PAM sequence after the gRNA, default NGG
`PAM.pattern`	Regular expression of protospacer-adjacent motif (PAM), default NNN$. Alternatively set it to (NAG\|NGG\|NGA)$ for off target search
`max.mismatch`	Maximum mismatch to the gRNA (not including mismatch to the PAM) allowed in off target search, default 6
`allowed.mismatch.PAM`	Maximum number of mismatches allowed for the PAM sequence plus the number of degenerate sequence in the PAM sequence, default to 2 for NGG PAM
`overwrite`	overwrite the existing files in the output directory or not, default FALSE
`weights`	a numeric vector size of gRNA length, default c(0, 0, 0.014, 0, 0, 0.395, 0.317, 0, 0.389, 0.079, 0.445, 0.508, 0.613, 0.851, 0.732, 0.828, 0.615, 0.804, 0.685, 0.583) for SPcas9 system, which is used in Hsu et al., 2013 cited in the reference section. Please make sure that the number of elements in this vector is the same as the gRNA.size, e.g., pad 0s at the beginning of the vector.
`orderOfftargetsBy`	Criteria to order the offtargets, which works together with the descending parameter
`descending`	Indicate the output order of the offtargets, i.e., in the descending or ascending order.
`keepTopOfftargetsOnly`	Output all offtargets or the top offtarget using the keepOfftargetsBy criteria, default to the top offtarget
`keepTopOfftargetsBy`	Output the top offtarget for each called peak using the keepTopOfftargetsBy criteria, If set to predicted_cleavage_score, then the offtargets with the highest predicted cleavage score will be retained If set to n.mismatch, then the offtarget with the lowest number of mismatch to the target sequence will be retained
`scoring.method`	Indicates which method to use for offtarget cleavage rate estimation, currently two methods are supported, Hsu-Zhang and CFDscore
`subPAM.activity`	Applicable only when scoring.method is set to CFDscore A hash to represent the cleavage rate for each alternative sub PAM sequence relative to preferred PAM sequence
`subPAM.position`	Applicable only when scoring.method is set to CFDscore The start and end positions of the sub PAM. Default to 22 and 23 for SP with 20bp gRNA and NGG as preferred PAM
`PAM.location`	PAM location relative to gRNA. For example, default to 3prime for spCas9 PAM. Please set to 5prime for cpf1 PAM since it's PAM is located on the 5 prime end
`mismatch.activity.file`	Applicable only when scoring.method is set to CFDscore A comma separated (csv) file containing the cleavage rates for all possible types of single nucleotide mismatche at each position of the gRNA. By default, use the supplemental Table 19 from Doench et al., Nature Biotechnology 2016
`bulge.activity.file`	Used for predicting indel effect on offtarget cleavage score. An excel file with the second sheet for deletion activity and the third sheet for Insertion. By default, use the supplemental Table 19 from Doench et al., Nature Biotechnology 2016
`txdb`	TxDb object, for creating and using TxDb object, please refer to GenomicFeatures package. For a list of existing TxDb object, please search for annotation package starting with Txdb at http://www.bioconductor.org/packages/release/BiocViews.html#___AnnotationData, such as TxDb.Rnorvegicus.UCSC.rn5.refGene for rat, TxDb.Mmusculus.UCSC.mm10.knownGene for mouse, TxDb.Hsapiens.UCSC.hg19.knownGene for human, TxDb.Dmelanogaster.UCSC.dm3.ensGene for Drosophila and TxDb.Celegans.UCSC.ce6.ensGene for C.elegans
`orgAnn`	organism annotation mapping such as org.Hs.egSYMBOL in org.Hs.eg.db package for human
`mat`	nucleotide substitution matrix. Function nucleotideSubstitutionMatrix can be used for creating customized nucleotide substitution matrix. By default, match = 1, mismatch = -1, and baseOnly = TRUE Only applicalbe with includeBulge set to TRUE
`includeBulge`	indicates whether including offtargets with indels default to FALSE
`max.n.bulge`	offtargets with at most this number of indels to be included in the offtarget list. Only applicalbe with includeBulge set to TRUE
`min.peak.score.bulge`	default to 60. Set it to a higher number to speed up the alignment with bulges. Any peaks with peak.score less than min.peak.score.bulge will not be included in the offtarget analysis with bulges. However, all peaks will be included in the offtarget analysis with mismatches.
`removeDuplicate`	default to TRUE. Set it to FALSE if PCR duplicates not to be removed for testing purpose
`resume`	default to FALSE to restart the analysis. set it TRUE to resume an analysis.
`ignoreTagmSite`	default to FALSE. To collapse reads with the same integration site and UMI but with different tagmentation site, set the option to TRUE.
`ignoreUMI`	default to FALSE. To collapse reads with the same integration and tagmentation site but with different UMIs, set the option to TRUE and retain the UMI that appears most frequently for each combination of integration and tagmentation site. In case of ties, randomly select one UMI.

Value

`offTargets`	a data frame, containing all input peaks with potential gRNA binding sites, mismatch number and positions, alignment to the input gRNA and predicted cleavage score.
`merged.peaks`	merged peaks as GRanges with count (peak height), bg (local background), SNratio (signal noise ratio), p-value, and option adjusted p-value
`peaks`	GRanges with count (peak height), bg (local background), SNratio (signal noise ratio), p-value, and option adjusted p-value
`uniqueCleavages`	Cleavage sites with one site per UMI as GRanges with metadata column total set to 1 for each range
`read.summary`	One table per input mapping file that contains the number of reads for each chromosome location
`sequence.depth`	sequence depth in the input alignment files

Author(s)

Lihua Julie Zhu

References

Lihua Julie Zhu, Michael Lawrence, Ankit Gupta, Herve Pages, Alper Ku- cukural, Manuel Garber and Scot A. Wolfe. GUIDEseq: a bioconductor package to analyze GUIDE-Seq datasets for CRISPR-Cas nucleases. BMC Genomics. 2017. 18:379

Examples


if(interactive())
    {
        library("BSgenome.Hsapiens.UCSC.hg19")
        umiFile <- system.file("extdata", "UMI-HEK293_site4_chr13.txt",
           package = "GUIDEseq")
        alignFile <- system.file("extdata","bowtie2.HEK293_site4_chr13.sort.bam" ,
            package = "GUIDEseq")
        gRNA.file <- system.file("extdata","gRNA.fa", package = "GUIDEseq")
        guideSeqRes <- GUIDEseqAnalysis(
            alignment.inputfile = alignFile,
            umi.inputfile = umiFile, gRNA.file = gRNA.file,
            orderOfftargetsBy = "peak_score",
            descending = TRUE,
            keepTopOfftargetsBy = "predicted_cleavage_score",
            scoring.method = "CFDscore",
            BSgenomeName = Hsapiens, min.reads = 80, n.cores.max = 1)
        guideSeqRes$offTargets
        names(guideSeqRes)
   }

LihuaJulieZhu/GUIDEseq documentation built on March 27, 2024, 9:42 p.m.