The *exomePeak2* user's guide

knitr::opts_chunk$set(dev="png",fig.show="hold",
               fig.width=8,fig.height=4.5,fig.align="center",
               message=FALSE,collapse=TRUE)

Introduction

exomePeak2 provides bias awared quantification and peak detection on Methylated RNA immunoprecipitation sequencing data (MeRIP-Seq). MeRIP-Seq is a commonly applied sequencing technology to measure the transcriptome-wide location and abundance of RNA modification sites under a given cellular condition. However, the quantification and peak calling in MeRIP-Seq are sensitive to PCR amplification bias which is prevalent in next generation sequencing (NGS) techniques. In addition, the RNA-Seq based count data exhibits biological variation in small reads count. exomePeak2 collectively address these challanges by introducing a rich set of robust data science models tailored for MeRIP-Seq. With exomePeak2, users can perform peak calling, modification site quantification, and differential analysis with a straightforward one-step function. Alternatively, users could customize methods for their own analysis through multi-step functions and diagnostic plots.

Installation

To install exomePeak2 from bioconductor, start R (version >"3.6") and enter:

if(!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("exomePeak2")

For order versions of R, please refer to the appropriate [Bioconductor release](https://www.bioconductor.org/about/release-announcements/}.

Peak Calling

For peak calling of MeRIP-Seq experiment, exomePeak2 demands the reads alignment results in BAM files. Users can specify the biological replicates of the IP and input samples by a character vector of the corresponding BAM directories at the arguments bam_ip and bam_input separately.

In the following example, the transcript annotation is provided using GFF files. Transcript annotation can also be provided by the TxDb object. exomePeak2 will automatically download the TxDb if the genome argument is filled with the corresponding UCSC genome name.

The genome sequence is required to conduct GC content bias correction. If the genome argument is missing ( = NULL ), exomPeak2 will perform peak calling without correcting the GC content bias.

library(exomePeak2)

set.seed(1)

GENE_ANNO_GTF = system.file("extdata", "example.gtf", package="exomePeak2")

f1 = system.file("extdata", "IP1.bam", package="exomePeak2")
f2 = system.file("extdata", "IP2.bam", package="exomePeak2")
f3 = system.file("extdata", "IP3.bam", package="exomePeak2")
f4 = system.file("extdata", "IP4.bam", package="exomePeak2")
IP_BAM = c(f1,f2,f3,f4)

f1 = system.file("extdata", "Input1.bam", package="exomePeak2")
f2 = system.file("extdata", "Input2.bam", package="exomePeak2")
f3 = system.file("extdata", "Input3.bam", package="exomePeak2")
INPUT_BAM = c(f1,f2,f3)

exomePeak2(bam_ip = IP_BAM,
           bam_input = INPUT_BAM,
           gff_dir = GENE_ANNO_GTF,
           genome = "hg19",
           paired_end = FALSE)

exomePeak2 will export the modification peaks in formats of BED file and CSV table, the data will be saved automatically under a folder named by exomePeak2_output.

The modification peak statistics are derived from the ${\beta}_{i,1}$ terms in the following linear regression design.

$$log2(Q_{i,j}) = {\beta}{i,0} + {\beta}{i,1}I(\rho(j)=IP) + t_{i,j}$$

$Q_{i,j}$ is the expected value of reads abundence of modification $i$ under sample $j$. ${\beta}{i,0}$ is the intercept coefficient, ${\beta}{i,1}$ is the coefficient for IP/input log2 fold change, $I(\rho(j)=IP)$ is the regression covariate that is the indicator variable for the sample $j$ being IP sample. $t_{i,j}$ is the regression offset that account for the sequencing depth variation and the GC content biases.

Under the default settings, the linear models fitted are the regularized GLM (Generalized Linear Model) of NB developed by DESeq2. If one of the IP and input group has no biological replicates, Poisson GLMs will be fitted to the modification peaks.

Explaination over the columns of the exported table:

Differential Modification Analysis

The code below could conduct differential modification analysis (Comparison of Two Conditions) on exon regions defined by the transcript annotation.

In differential modification mode, exomePeak2 will first perform Peak calling on exon regions using both the control and treated samples. Then, it will conduct the differential modification analysis on peaks reported from peak calling using an interactive GLM.

f1 = system.file("extdata", "treated_IP1.bam", package="exomePeak2")
TREATED_IP_BAM = c(f1)
f1 = system.file("extdata", "treated_Input1.bam", package="exomePeak2")
TREATED_INPUT_BAM = c(f1)

exomePeak2(bam_ip = IP_BAM,
           bam_input = INPUT_BAM,
           bam_treated_input = TREATED_INPUT_BAM,
           bam_treated_ip = TREATED_IP_BAM,
           gff_dir = GENE_ANNO_GTF,
           genome = "hg19",
           paired_end = FALSE)

In differential modification mode, exomePeak2 will export the differential modification peaks in formats of BED file and CSV table, the data will also be saved automatically under a folder named by exomePeak2_output.

The peak statistics in differential modification setting are derived from the interactive coefficient ${\beta}_{i,3}$ in the following regression design of the NB GLM:

$$log2(Q_{i,j}) = {\beta}{i,0} + {\beta}{i,1}I(\rho(j)=IP) + {\beta}{i,2}I(\rho(j)=Treatment) + {\beta}{i,3}I(\rho(j)=IP\&Treatment) + t_{i,j}$$

Explaination for the additional table columns:

Quantification and Statistical Analysis with Single Based Modification Annotation

exomePeak2 supports the modification quantification and differential modification analysis on single based modification annotation. The modification sites with single based resolution can provide a more accurate mapping of modification locations compared with the peaks called directly from the MeRIP-seq datasets.

Some of the datasets in epitranscriptomics have a single based resolution, e.x. Data generated by the m6A-CLIP-Seq or m6A-miCLIP-Seq techniques. Reads count on the single based modification sites could provide a more accurate and consistent quantification on MeRIP-Seq experiments due to the elimination of the technical variation introduced by the feature lengths.

exomePeak2 will automatically initiate the mode of single based modification quantification by providing a sigle based annotation file under the argument mod_annot.

The single based annotation information should be provided to the exomePeak2 function in the format of a GRanges object.

f2 = system.file("extdata", "mod_annot.rds", package="exomePeak2")

MOD_ANNO_GRANGE <- readRDS(f2)

exomePeak2(bam_ip = IP_BAM,
           bam_input = INPUT_BAM,
           gff_dir = GENE_ANNO_GTF,
           genome = "hg19",
           paired_end = FALSE,
           mod_annot = MOD_ANNO_GRANGE)

In this mode, exomePeak2 will export the analysis result also in formats of BED file and CSV table, while each row of the table corresponds to the sites of the annotation GRanges.

Peak Calling and Visualization in Multiple Steps

The exomePeak2 package can achieve peak calling and peak statistics calulation with multiple functions.

1. Check the bam files of MeRIP-seq data before peak calling.

MeRIP_Seq_Alignment <- scanMeripBAM(
                         bam_ip = IP_BAM,
                         bam_input = INPUT_BAM,
                         paired_end = FALSE
                        )

For MeRIP-seq experiment with interactive design (contain control and treatment groups), use the following code.

MeRIP_Seq_Alignment <- scanMeripBAM(
    bam_ip = IP_BAM,
    bam_input = INPUT_BAM,
    bam_treated_input = TREATED_INPUT_BAM,
    bam_treated_ip = TREATED_IP_BAM,
    paired_end = FALSE
  ) 

2. Conduct peak calling analysis on exons using the provided bam files.

SummarizedExomePeaks <- exomePeakCalling(merip_bams = MeRIP_Seq_Alignment,
                                         gff_dir = GENE_ANNO_GTF,
                                         genome = "hg19") 

Alternatively, use the following code to quantify MeRIP-seq data on single based modification annotation.

SummarizedExomePeaks <- exomePeakCalling(merip_bams = MeRIP_Seq_Alignment,
                                         gff_dir = GENE_ANNO_GTF,
                                         genome = "hg19",
                                         mod_annot = MOD_ANNO_GRANGE) 

3. Estimate size factors that are required for GC content bias correction.

SummarizedExomePeaks <- normalizeGC(SummarizedExomePeaks)

4. Report the statistics of modification peaks using Generalized Linear Model (GLM).

SummarizedExomePeaks <- glmM(SummarizedExomePeaks) 

Alternatively, If the treated IP and input bam files are provided, glmDM function could be used to conduct differential modification analysis on modification Peaks with interactive GLM.

SummarizedExomePeaks <- glmDM(SummarizedExomePeaks)

5. Generate the scatter plot between GC content and log2 Fold Change (LFC).

plotLfcGC(SummarizedExomePeaks) 

6. Generate the bar plot for the sequencing depth size factors.

plotSizeFactors(SummarizedExomePeaks)

7. Export the modification peaks and the peak statistics with user decided format.

exportResults(SummarizedExomePeaks, format = "BED") 

Contact

Please contact the maintainer of exomePeak2 if you have encountered any problems:

ZhenWei : zhen.wei01@xjtlu.edu.cn

Please visit the github page of exomePeak2:

https://github.com/ZhenWei10/exomePeak2

Session Info

sessionInfo()


Try the exomePeak2 package in your browser

Any scripts or data that you put into this service are public.

exomePeak2 documentation built on Nov. 8, 2020, 5:27 p.m.