exomePeak2: Peak Calling and Peak Statistics Quantification on MeRIP-seq...
In exomePeak2: Bias Awared Peak Calling and Quantification for MeRIP-Seq

Description Usage Arguments Details Value See Also Examples

exomePeak2 conducts peak calling and peak statistics calculation from BAM files of a MeRIP-seq experiment. The function integrates the following steps of a standard MeRIP-seq data analysis pipeline.

Check and index the BAM files with scanMeripBAM.
Call modification peaks on exons with exomePeakCalling.
Calculate offset factors of GC content biases with normalizeGC.
Calculate (differential) modification statistics with the generalized linear model (GLM) using glmM or glmDM.
Export the peaks/sites statistics with user defined format by exportResults.

See the help pages of the corresponding functions for the complete documentation.

exomePeak2(
  bam_ip = NULL,
  bam_input = NULL,
  bam_treated_ip = NULL,
  bam_treated_input = NULL,
  txdb = NULL,
  bsgenome = NULL,
  genome = NA,
  gff_dir = NULL,
  mod_annot = NULL,
  paired_end = FALSE,
  library_type = c("unstranded", "1st_strand", "2nd_strand"),
  fragment_length = 100,
  binding_length = 25,
  step_length = binding_length,
  peak_width = fragment_length/2,
  pc_count_cutoff = 5,
  bg_count_cutoff = 50,
  p_cutoff = 1e-05,
  p_adj_cutoff = NULL,
  log2FC_cutoff = 1,
  consistent_peak = FALSE,
  consistent_log2FC_cutoff = 1,
  consistent_fdr_cutoff = 0.05,
  alpha = 0.05,
  p0 = 0.8,
  parallel = FALSE,
  background_method = c("all", "Gaussian_mixture", "m6Aseq_prior", "manual"),
  manual_background = NULL,
  correct_GC_bg = TRUE,
  qtnorm = FALSE,
  glm_type = c("DESeq2", "Poisson", "NB"),
  LFC_shrinkage = c("apeglm", "ashr", "Gaussian", "none"),
  export_results = TRUE,
  export_format = c("CSV", "BED", "RDS"),
  table_style = c("bed", "granges"),
  save_plot_GC = TRUE,
  save_plot_analysis = FALSE,
  save_plot_name = "",
  save_dir = "exomePeak2_output",
  peak_calling_mode = c("exon", "full_tx", "whole_genome")
)

`bam_ip`	a `character` vector for the BAM file directories of the (control) IP samples.
`bam_input`	a `character` vector for the BAM file directories of the (control) input samples.
`bam_treated_ip`	a `character` vector for the BAM file directories of the treated IP samples.
`bam_treated_input`	a `character` vector for the BAM file directories of the treated input samples. If the bam files do not contain treatment group, user should only fill the arguments of `BAM_ip` and `BAM_input`.
`txdb`	a `TxDb` object for the transcript annotation. If the `TxDb` object is not available, it could be a `character` string of the UCSC genome name which is acceptable by `makeTxDbFromUCSC`. For example: `"hg19"`.
`bsgenome`	a `BSgenome` object for the genome sequence information. If the `BSgenome` object is not available, it could be a `character` string of the UCSC genome name which is acceptable by `getBSgenome`. For example: `"hg19"`.
`genome`	a `character` string of the UCSC genome name which is acceptable by `getBSgenome` or/and `makeTxDbFromUCSC`. For example: `"hg19"`. By default, the argument = NA, it should be provided when the `BSgenome` or/and the `TxDb` object are not available.
`gff_dir`	optional, a `character` which specifies the directory toward a gene annotation GFF/GTF file, it is applied when the `TxDb` object is not available, default `= NULL`.
`mod_annot`	a `GRanges` object for user provided single based RNA modification annotation. If user provides the single based RNA modification annotation, exomePeak2 will perform reads count and (differential) modification quantification on the provided annotation. The single base annotation will be flanked by length = floor(fragment_length - binding_length/2) to account for the fragment length of the sequencing library.
`paired_end`	a `logical` of whether the data comes from the Paired-End Library, `TRUE` if the data is Paired-End sequencing; default `FALSE`.
`library_type`	a `character` specifying the protocal type of the RNA-seq library, can be one in `c("unstranded", "1st_strand", "2nd_strand")`; default `= "unstranded"`. unstranded The randomly primed RNA-seq library type, i.e. both the strands generated during the first and the second strand sythesis are sequenced; example: Standard Illumina. 1st_strand The first strand-specific RNA-seq library, only the strand generated during the first strand sythesis is sequenced; examples: dUTP, NSR, NNSR. 2nd_strand The second strand-specific RNA-seq library, only the strand generated during the second strand sythesis is sequenced; examples: Ligation, Standard SOLiD.
`fragment_length`	a positive integer number for the expected fragment length in nucleotides; default `= 100`.
`binding_length`	a positive integer number for the expected binding length of the anti-modification antibody in IP samples; default `= 25`.
`step_length`	a positive integer number for the shift distances of the sliding window; default `= binding_length`.
`peak_width`	a `numeric` value for the minimum width of the merged peaks; default `= fragment_length` .
`pc_count_cutoff`	a `numeric` value for the cutoff on average window's reads count in peak calling; default `= 5`.
`bg_count_cutoff`	a `numeric` value for the cutoff on average window's reads count in background identification; default `= 50`.
`p_cutoff`	a `numeric` value for the cutoff on p values in peak calling; default `= 1e-05`.
`p_adj_cutoff`	a `numeric` value for the cutoff on Benjamini Hochberg adjusted p values in peak calling; default `= NULL`.
`log2FC_cutoff`	a `numeric` value for the cutoff on log2 IP over input fold changes in peak calling; default `= 1`.
`consistent_peak`	a `logical` of whether the positive peaks returned should be consistent among all the replicates; default `= FALSE`.
`consistent_log2FC_cutoff`	a `numeric` for the modification log2 fold changes cutoff in the peak consisency calculation; default = 1.
`consistent_fdr_cutoff`	a `numeric` for the BH adjusted C-test p values cutoff in the peak consistency calculation; default = 0.05. Check `ctest`.
`alpha`	a `numeric` for the binomial quantile used in the consitent peak filter; default `= 0.05`.
`p0`	a `numeric` for the binomial proportion parameter used in the consistent peak filter; default `= 0.8`. For a peak to be consistently methylated, the minimum number of significant enriched replicate pairs is defined as the 1 - alpha quantile of a binomial distribution with p = p0 and N = number of possible pairs between replicates. The consistency defined in this way is equivalent to the rejection of an exact binomial test with null hypothesis of p < p0 and N = replicates number of IP * replicates number of input.
`parallel`	a `logical` value indicating whether to use parallel computation, it will require > 16GB memory if `parallel = TRUE`; default `= FALSE`.
`background_method`	a `character` specifies the method of finding background regions for peak detection, i.e. to identify the windows without modification signal. It could be one of "Gaussian_mixture", "m6Aseq_prior", "manual", and "all"; default `= "all"`. In order to accurately account for the technical variations, it is often important to estimate the sequencing depth and GC content linear effects on windows without modification signals. The following methods are supported in `ExomePeak2` to differentiate the no modification background windows from the modification containig windows. `all` Use all windows as the background. This choise assumes no differences in the effects of technical features between the background and the modification regions. `Gaussian_mixture` The background is identified by Multivariate Gaussian Mixture Model (MGMM) with 2 mixing components on the vectors containing methylation level estimates and GC content, the background regions are predicted by the Bayes Classifier on the learned GMM. `m6Aseq_prior` The background is identified by the prior knowledge of m6A topology, the windows that are not overlapped with long exons (exon length >= 400bp) and 5'UTR are treated as the background windows. This type of background should not be used if the MeRIP-seq data is not targetting on m6A methylation. `manual` The background regions are defined by the user manually at the argument `manual_background`.
`manual_background`	a `GRanges` object for the user provided unmodified background; default `= NULL`.
`correct_GC_bg`	a `logical` value of whether to estimate the GC content linear effect on background regions during modification level quantification; default `= TRUE`. If `= TRUE`, it could lead to a more accurate estimation of GC content bias for the RNA modifications that are highly biologically related to GC content.
`qtnorm`	a `logical` of whether to perform subset quantile normalization after the GC content linear effect correction; default `= FALSE`. If `qtnorm = TRUE`, subset quantile normalization will be applied within the IP and input samples seperately to account for the inherent differences between the marginal distributions of IP and input samples.
`glm_type`	a `character` speciefies the type of Generalized Linear Model (GLM) fitted for the purpose of statistical inference during peak calling, which can be one of the `c("DESeq2", "NB", "Poisson")`. `DESeq2` Fit the GLM defined in the function `DESeq`, which is the NB GLM with regulated estimation of the overdispersion parameters. `NB` Fit the ordinary Negative Binomial (NB) GLM. `Poisson` Fit the Poisson GLM. By default, the DESeq2 GLMs are fitted on the data set with > 1 biological replicates for both the IP and input samples, the Poisson GLM will be fitted otherwise.
`LFC_shrinkage`	a `character` for the method of emperical bayes shrinkage on log2FC, could be one of `c("apeglm", "ashr", "Gaussian", "none")`; Default `= "apeglm"`. see `lfcShrink` for more details; if "none" is selected, only the MLE will be returned.
`export_results`	a `logical` of whether to save the results on disk; default `= TRUE`.
`export_format`	a `character` vector for the format(s) of the result being exported, could be the subset of `c("CSV","BED","RDS")`; Default `= c("CSV","BED","RDS")`.
`table_style`	a `character` for the style of the table being exported, could be one of `c("bed", "granges")`; Default `= "bed"`.
`save_plot_GC`	a `logical` of whether to generate the plots for GC content bias assessment; default `= TRUE`.
`save_plot_analysis`	a `logical` of whether to generate the plots for genomic analysis on modification sites; default `= FALSE`.
`save_plot_name`	a `character` for the name of the plots being saved; Default `= "Plot"`.
`save_dir`	a `character` for the name of the directory being saved; Default `= "exomePeak2_output"`.
`peak_calling_mode`	a `character` specifies the scope of peak calling on genome, can be one of `c("exon", "full_transcript", "whole_genome")`; Default `= "exon"`. `exon` generate sliding windows on exon regions. `full_transcript` generate sliding windows on the full transcripts (include both introns and exons). `whole_genome` generate sliding windows on the whole genome (include introns, exons, and the intergenic regions). P.S. The full transcript mode and the whole genome mode demand big memory size (> 4GB) for large genomes.

exomePeak2 call RNA modification peaks and calculate peak statistics from BAM files of a MeRIP-seq experiment.

The transcript annotation (from either the TxDb object or the GFF file) should be provided to perform analysis on exons.

The BSgenome object is also required to perform the GC content bias adjustment. If the bsgenome and the genome arguments are not provided (= NULL), the downstream analysis will proceed without GC content bias corrections.

If the BAM files in treated samples are provided at the arguments bam_treated_ip and bam_treated_input, the statistics of differential modification analysis on peaks/sites will be reported.

Under default setting, exomePeak2 will save the results of (differential) modification analysis under a folder named by 'exomePeak2_output'. The results generated include a BED file and a CSV table that store the locations and statistics of the (differential) modified peaks/sites.

a SummarizedExomePeak object.

exomePeakCalling, glmM, glmDM, normalizeGC, exportResults, plotLfcGC

#Specify File Directories

GENE_ANNO_GTF = system.file("extdata", "example.gtf", package="exomePeak2")

f1 = system.file("extdata", "IP1.bam", package="exomePeak2")
f2 = system.file("extdata", "IP2.bam", package="exomePeak2")
f3 = system.file("extdata", "IP3.bam", package="exomePeak2")
f4 = system.file("extdata", "IP4.bam", package="exomePeak2")
IP_BAM = c(f1,f2,f3,f4)
f1 = system.file("extdata", "Input1.bam", package="exomePeak2")
f2 = system.file("extdata", "Input2.bam", package="exomePeak2")
f3 = system.file("extdata", "Input3.bam", package="exomePeak2")
INPUT_BAM = c(f1,f2,f3)

# Peak Calling

sep <- exomePeak2(bam_ip = IP_BAM,
                  bam_input = INPUT_BAM,
                  gff_dir = GENE_ANNO_GTF,
                  genome = "hg19",
                  paired_end = FALSE)
sep

# Differential Modification Analysis on Modification Peaks (Comparison of Two Conditions)

f1 = system.file("extdata", "treated_IP1.bam", package="exomePeak2")
TREATED_IP_BAM = c(f1)
f1 = system.file("extdata", "treated_Input1.bam", package="exomePeak2")
TREATED_INPUT_BAM = c(f1)

sep <- exomePeak2(bam_ip = IP_BAM,
                  bam_input = INPUT_BAM,
                  bam_treated_input = TREATED_INPUT_BAM,
                  bam_treated_ip = TREATED_IP_BAM,
                  gff_dir = GENE_ANNO_GTF,
                  genome = "hg19",
                  paired_end = FALSE)
sep

# Modification Level Quantification on Single Based Modification Annotation

f2 = system.file("extdata", "mod_annot.rds", package="exomePeak2")

MOD_ANNO_GRANGE <- readRDS(f2)

sep <- exomePeak2(bam_ip = IP_BAM,
                  bam_input = INPUT_BAM,
                  gff_dir = GENE_ANNO_GTF,
                  genome = "hg19",
                  paired_end = FALSE,
                  mod_annot = MOD_ANNO_GRANGE)
sep

# Differential Modification Analysis on Single Based Modification Annotation

sep <- exomePeak2(bam_ip = IP_BAM,
                  bam_input = INPUT_BAM,
                  bam_treated_input = TREATED_INPUT_BAM,
                  bam_treated_ip = TREATED_IP_BAM,
                  gff_dir = GENE_ANNO_GTF,
                  genome = "hg19",
                  paired_end = FALSE,
                  mod_annot = MOD_ANNO_GRANGE)
sep