exomePeak2: Peak Calling and Peak Statistics Quantification on MeRIP-seq...

Description Usage Arguments Details Value See Also Examples

View source: R/exomePeak2.R

Description

exomePeak2 conducts peak calling and peak statistics calculation from BAM files of a MeRIP-seq experiment. The function integrates the following steps of a standard MeRIP-seq data analysis pipeline.

  1. Check and index the BAM files with scanMeripBAM.

  2. Call modification peaks on exons with exomePeakCalling.

  3. Calculate offset factors of GC content biases with normalizeGC.

  4. Calculate (differential) modification statistics with the generalized linear model (GLM) using glmM or glmDM.

  5. Export the peaks/sites statistics with user defined format by exportResults.

See the help pages of the corresponding functions for the complete documentation.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
exomePeak2(
  bam_ip = NULL,
  bam_input = NULL,
  bam_treated_ip = NULL,
  bam_treated_input = NULL,
  txdb = NULL,
  bsgenome = NULL,
  genome = NA,
  gff_dir = NULL,
  mod_annot = NULL,
  paired_end = FALSE,
  library_type = c("unstranded", "1st_strand", "2nd_strand"),
  fragment_length = 100,
  binding_length = 25,
  step_length = binding_length,
  peak_width = fragment_length/2,
  pc_count_cutoff = 5,
  bg_count_cutoff = 50,
  p_cutoff = 1e-05,
  p_adj_cutoff = NULL,
  log2FC_cutoff = 1,
  consistent_peak = FALSE,
  consistent_log2FC_cutoff = 1,
  consistent_fdr_cutoff = 0.05,
  alpha = 0.05,
  p0 = 0.8,
  parallel = FALSE,
  background_method = c("all", "Gaussian_mixture", "m6Aseq_prior", "manual"),
  manual_background = NULL,
  correct_GC_bg = TRUE,
  qtnorm = FALSE,
  glm_type = c("DESeq2", "Poisson", "NB"),
  LFC_shrinkage = c("apeglm", "ashr", "Gaussian", "none"),
  export_results = TRUE,
  export_format = c("CSV", "BED", "RDS"),
  table_style = c("bed", "granges"),
  save_plot_GC = TRUE,
  save_plot_analysis = FALSE,
  save_plot_name = "",
  save_dir = "exomePeak2_output",
  peak_calling_mode = c("exon", "full_tx", "whole_genome")
)

Arguments

bam_ip

a character vector for the BAM file directories of the (control) IP samples.

bam_input

a character vector for the BAM file directories of the (control) input samples.

bam_treated_ip

a character vector for the BAM file directories of the treated IP samples.

bam_treated_input

a character vector for the BAM file directories of the treated input samples.

If the bam files do not contain treatment group, user should only fill the arguments of BAM_ip and BAM_input.

txdb

a TxDb object for the transcript annotation.

If the TxDb object is not available, it could be a character string of the UCSC genome name which is acceptable by makeTxDbFromUCSC. For example: "hg19".

bsgenome

a BSgenome object for the genome sequence information.

If the BSgenome object is not available, it could be a character string of the UCSC genome name which is acceptable by getBSgenome. For example: "hg19".

genome

a character string of the UCSC genome name which is acceptable by getBSgenome or/and makeTxDbFromUCSC. For example: "hg19".

By default, the argument = NA, it should be provided when the BSgenome or/and the TxDb object are not available.

gff_dir

optional, a character which specifies the directory toward a gene annotation GFF/GTF file, it is applied when the TxDb object is not available, default = NULL.

mod_annot

a GRanges object for user provided single based RNA modification annotation.

If user provides the single based RNA modification annotation, exomePeak2 will perform reads count and (differential) modification quantification on the provided annotation.

The single base annotation will be flanked by length = floor(fragment_length - binding_length/2) to account for the fragment length of the sequencing library.

paired_end

a logical of whether the data comes from the Paired-End Library, TRUE if the data is Paired-End sequencing; default FALSE.

library_type

a character specifying the protocal type of the RNA-seq library, can be one in c("unstranded", "1st_strand", "2nd_strand"); default = "unstranded".

unstranded

The randomly primed RNA-seq library type, i.e. both the strands generated during the first and the second strand sythesis are sequenced; example: Standard Illumina.

1st_strand

The first strand-specific RNA-seq library, only the strand generated during the first strand sythesis is sequenced; examples: dUTP, NSR, NNSR.

2nd_strand

The second strand-specific RNA-seq library, only the strand generated during the second strand sythesis is sequenced; examples: Ligation, Standard SOLiD.

fragment_length

a positive integer number for the expected fragment length in nucleotides; default = 100.

binding_length

a positive integer number for the expected binding length of the anti-modification antibody in IP samples; default = 25.

step_length

a positive integer number for the shift distances of the sliding window; default = binding_length.

peak_width

a numeric value for the minimum width of the merged peaks; default = fragment_length .

pc_count_cutoff

a numeric value for the cutoff on average window's reads count in peak calling; default = 5.

bg_count_cutoff

a numeric value for the cutoff on average window's reads count in background identification; default = 50.

p_cutoff

a numeric value for the cutoff on p values in peak calling; default = 1e-05.

p_adj_cutoff

a numeric value for the cutoff on Benjamini Hochberg adjusted p values in peak calling; default = NULL.

log2FC_cutoff

a numeric value for the cutoff on log2 IP over input fold changes in peak calling; default = 1.

consistent_peak

a logical of whether the positive peaks returned should be consistent among all the replicates; default = FALSE.

consistent_log2FC_cutoff

a numeric for the modification log2 fold changes cutoff in the peak consisency calculation; default = 1.

consistent_fdr_cutoff

a numeric for the BH adjusted C-test p values cutoff in the peak consistency calculation; default = 0.05. Check ctest.

alpha

a numeric for the binomial quantile used in the consitent peak filter; default = 0.05.

p0

a numeric for the binomial proportion parameter used in the consistent peak filter; default = 0.8.

For a peak to be consistently methylated, the minimum number of significant enriched replicate pairs is defined as the 1 - alpha quantile of a binomial distribution with p = p0 and N = number of possible pairs between replicates.

The consistency defined in this way is equivalent to the rejection of an exact binomial test with null hypothesis of p < p0 and N = replicates number of IP * replicates number of input.

parallel

a logical value indicating whether to use parallel computation, it will require > 16GB memory if parallel = TRUE; default = FALSE.

background_method

a character specifies the method of finding background regions for peak detection, i.e. to identify the windows without modification signal. It could be one of "Gaussian_mixture", "m6Aseq_prior", "manual", and "all"; default = "all".

In order to accurately account for the technical variations, it is often important to estimate the sequencing depth and GC content linear effects on windows without modification signals.

The following methods are supported in ExomePeak2 to differentiate the no modification background windows from the modification containig windows.

all

Use all windows as the background. This choise assumes no differences in the effects of technical features between the background and the modification regions.

Gaussian_mixture

The background is identified by Multivariate Gaussian Mixture Model (MGMM) with 2 mixing components on the vectors containing methylation level estimates and GC content, the background regions are predicted by the Bayes Classifier on the learned GMM.

m6Aseq_prior

The background is identified by the prior knowledge of m6A topology, the windows that are not overlapped with long exons (exon length >= 400bp) and 5'UTR are treated as the background windows.

This type of background should not be used if the MeRIP-seq data is not targetting on m6A methylation.

manual

The background regions are defined by the user manually at the argument manual_background.

manual_background

a GRanges object for the user provided unmodified background; default = NULL.

correct_GC_bg

a logical value of whether to estimate the GC content linear effect on background regions during modification level quantification; default = TRUE.

If = TRUE, it could lead to a more accurate estimation of GC content bias for the RNA modifications that are highly biologically related to GC content.

qtnorm

a logical of whether to perform subset quantile normalization after the GC content linear effect correction; default = FALSE.

If qtnorm = TRUE, subset quantile normalization will be applied within the IP and input samples seperately to account for the inherent differences between the marginal distributions of IP and input samples.

glm_type

a character speciefies the type of Generalized Linear Model (GLM) fitted for the purpose of statistical inference during peak calling, which can be one of the c("DESeq2", "NB", "Poisson").

DESeq2

Fit the GLM defined in the function DESeq, which is the NB GLM with regulated estimation of the overdispersion parameters.

NB

Fit the ordinary Negative Binomial (NB) GLM.

Poisson

Fit the Poisson GLM.

By default, the DESeq2 GLMs are fitted on the data set with > 1 biological replicates for both the IP and input samples, the Poisson GLM will be fitted otherwise.

LFC_shrinkage

a character for the method of emperical bayes shrinkage on log2FC, could be one of c("apeglm", "ashr", "Gaussian", "none"); Default = "apeglm".

see lfcShrink for more details; if "none" is selected, only the MLE will be returned.

export_results

a logical of whether to save the results on disk; default = TRUE.

export_format

a character vector for the format(s) of the result being exported, could be the subset of c("CSV","BED","RDS"); Default = c("CSV","BED","RDS").

table_style

a character for the style of the table being exported, could be one of c("bed", "granges"); Default = "bed".

save_plot_GC

a logical of whether to generate the plots for GC content bias assessment; default = TRUE.

save_plot_analysis

a logical of whether to generate the plots for genomic analysis on modification sites; default = FALSE.

save_plot_name

a character for the name of the plots being saved; Default = "Plot".

save_dir

a character for the name of the directory being saved; Default = "exomePeak2_output".

peak_calling_mode

a character specifies the scope of peak calling on genome, can be one of c("exon", "full_transcript", "whole_genome"); Default = "exon".

exon

generate sliding windows on exon regions.

full_transcript

generate sliding windows on the full transcripts (include both introns and exons).

whole_genome

generate sliding windows on the whole genome (include introns, exons, and the intergenic regions).

P.S. The full transcript mode and the whole genome mode demand big memory size (> 4GB) for large genomes.

Details

exomePeak2 call RNA modification peaks and calculate peak statistics from BAM files of a MeRIP-seq experiment.

The transcript annotation (from either the TxDb object or the GFF file) should be provided to perform analysis on exons.

The BSgenome object is also required to perform the GC content bias adjustment. If the bsgenome and the genome arguments are not provided (= NULL), the downstream analysis will proceed without GC content bias corrections.

If the BAM files in treated samples are provided at the arguments bam_treated_ip and bam_treated_input, the statistics of differential modification analysis on peaks/sites will be reported.

Under default setting, exomePeak2 will save the results of (differential) modification analysis under a folder named by 'exomePeak2_output'. The results generated include a BED file and a CSV table that store the locations and statistics of the (differential) modified peaks/sites.

Value

a SummarizedExomePeak object.

See Also

exomePeakCalling, glmM, glmDM, normalizeGC, exportResults, plotLfcGC

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
#Specify File Directories

GENE_ANNO_GTF = system.file("extdata", "example.gtf", package="exomePeak2")

f1 = system.file("extdata", "IP1.bam", package="exomePeak2")
f2 = system.file("extdata", "IP2.bam", package="exomePeak2")
f3 = system.file("extdata", "IP3.bam", package="exomePeak2")
f4 = system.file("extdata", "IP4.bam", package="exomePeak2")
IP_BAM = c(f1,f2,f3,f4)
f1 = system.file("extdata", "Input1.bam", package="exomePeak2")
f2 = system.file("extdata", "Input2.bam", package="exomePeak2")
f3 = system.file("extdata", "Input3.bam", package="exomePeak2")
INPUT_BAM = c(f1,f2,f3)

# Peak Calling

sep <- exomePeak2(bam_ip = IP_BAM,
                  bam_input = INPUT_BAM,
                  gff_dir = GENE_ANNO_GTF,
                  genome = "hg19",
                  paired_end = FALSE)
sep

# Differential Modification Analysis on Modification Peaks (Comparison of Two Conditions)

f1 = system.file("extdata", "treated_IP1.bam", package="exomePeak2")
TREATED_IP_BAM = c(f1)
f1 = system.file("extdata", "treated_Input1.bam", package="exomePeak2")
TREATED_INPUT_BAM = c(f1)

sep <- exomePeak2(bam_ip = IP_BAM,
                  bam_input = INPUT_BAM,
                  bam_treated_input = TREATED_INPUT_BAM,
                  bam_treated_ip = TREATED_IP_BAM,
                  gff_dir = GENE_ANNO_GTF,
                  genome = "hg19",
                  paired_end = FALSE)
sep

# Modification Level Quantification on Single Based Modification Annotation

f2 = system.file("extdata", "mod_annot.rds", package="exomePeak2")

MOD_ANNO_GRANGE <- readRDS(f2)

sep <- exomePeak2(bam_ip = IP_BAM,
                  bam_input = INPUT_BAM,
                  gff_dir = GENE_ANNO_GTF,
                  genome = "hg19",
                  paired_end = FALSE,
                  mod_annot = MOD_ANNO_GRANGE)
sep

# Differential Modification Analysis on Single Based Modification Annotation

sep <- exomePeak2(bam_ip = IP_BAM,
                  bam_input = INPUT_BAM,
                  bam_treated_input = TREATED_INPUT_BAM,
                  bam_treated_ip = TREATED_IP_BAM,
                  gff_dir = GENE_ANNO_GTF,
                  genome = "hg19",
                  paired_end = FALSE,
                  mod_annot = MOD_ANNO_GRANGE)
sep

exomePeak2 documentation built on Nov. 8, 2020, 5:27 p.m.