knitr::opts_chunk$set(echo = TRUE)

1 Introduction to Trumpet

MeRIP-seq can be considered as a combination of ChIP-seq and RNA-seq, which is an enrichment-based approach that profiles the transcriptome-wide distribution of RNA modifications.

In MeRIP-seq, as in all other NGS applications, quality control is essential. However, to our knowledge, no metrics or software can be used to assess the MeRIP-seq experiment data recently. Thus, we developed an R package named Trumpet that come up with some metrics to assess the MeRIP-seq data quality easily. Trumpet, the R package for quality assessment of MeRIP-seq data, can also visualize the assessment report in HTML format.

The quality of MeRIP-seq data is assessed by the Trumpet package from mainly 3 perspectives, including (1) statistics of sequencing reads distribution with respect to different genomic regions; (2) the strength of immunoprecipitation signal evaluated by exome signal extraction scaling (ESES) and other statistical approaches; (3) comparison between different biological replicates and identify possible outliers.

2 Start using Trumpet

2.1 Input the MeRIP-seq data for quality assessment

The MeRIP-seq data for quality assessment should include the IP, Input samples(single condition) or IP, Input, contrast IP and contrast Input samples(different conditions). These samples should be BAM format files with alignment reads. Users also need to provide the transcriptome annotation file with a GTF file, a TxDb object or download the annotation file from UCSC automatically.

2.2 Get sequencing reads from input samples

The following part will extract the reads from input samples for subsequent analysis. It must import the paths of input sample data (e.g. the location of IP_BAM, Input_BAM) and the location of the transcriptome annotation file like gtf or txdb object.

library(Trumpet)
setwd(outparam_dir)
load("parameter.Rdata")
result<-.get_readscount(IP_BAM,Input_BAM,contrast_IP_BAM,contrast_Input_BAM,GENE_ANNO_GTF, GENOME,UCSC_TABLE_NAME,TXDB,sample_size)

3 Basic quality assessment module

3.1 Statistics of sequencing reads and Whole-transcriptome heterogeneity of reads coverage

In this section, we mainly evaluate reads alignment and their distribution, with which, we can inspect the sequencing depth of the input files, the reads alignment mapped to different genomic regions, such as exon, intron, 5'UTR, CDS and 3'UTR. It aims to get global information of samples. In order to show the heterogeneity of read coverage in the entire transcriptome due to mainly different level of gene expression, PCR artifacts and randomness, we use bin-based approach to check the percentage of regions covered different number of reads. The results are shown in the following four tables.

stats_result<-.stats_readscount(result,IP_BAM,Input_BAM,contrast_IP_BAM,contrast_Input_BAM)
transform_table <- stats_result[[1]]
reads_summary <- stats_result[[2]]
p_trans <- stats_result[[3]]
pt_trans <- stats_result[[4]]
p_bin <- stats_result[[5]]
pt_bin <- stats_result[[6]]
exonic_region <- rbind(p_trans, pt_trans)
exonic_bin <- rbind(p_bin, pt_bin)
kable(transform_table,caption='Transform the input files into appropriate labels',align = 'c')
kable(reads_summary, caption='Reads alignment summary',align = 'c')
kable(exonic_region, caption='Exonic regions of different reads coverage',align = 'c')
kable(exonic_bin, caption='The percentage of bins with different reads coverage',align='c')

Note: Column explanation in each table.

1. Transform the input files in appropriate labels

2. Read alignment summary

3. Whole-transcriptome heterogeneity of read coverage

3.2 Visualization of reads distribution

This part shows the reads coverage distribution over coding exons(CDS) region, 5'UTR region and 3'UTR region under different quantiles. It is known that, RNA m6A methylation is enriched near stop codon, and hopefully, consistent signal can be observed on aligned reads in IP samples. The quantiles (25%, 50% and 75%) of the standardized read coverage at different genomic regions (CDS, 5'UTR and 3'UTR) for genes detected are then plotted as shown in the following figures.

read_coverage<-.read_distribute(GENOME, UCSC_TABLE_NAME, GENE_ANNO_GTF, TXDB, result, IP_BAM, Input_BAM, contrast_IP_BAM, contrast_Input_BAM, condition1, condition2)

4 Quality assessment for the enrichment signal strength

4.1 Assessing immunoprecipitation efficiency with ESES

This module evaluates the enrichment of m6A signal in IP samples and identifies potentially failed immunoprecipitation experiments. The Trumpet package uses exome signal extraction scaling (ESES) metrics, which is modified based on the signal extraction scaling (SES) approach that previously used in assessing ChIP-seq data signals. ESES metrics can calculate the percent of region enriched with m6A signal and the scale factor showing the degree of difference between the IP and Input samples. It can also detect abnormal Input control samples. The assessment result will return figures and a table.

# Graphical summary of IP enrichment and the difference between Input samples using SES method
out_SES<-.ESES_evaluate(result,IP_BAM,Input_BAM,contrast_IP_BAM,contrast_Input_BAM,condition1,condition2)
## Tables show enrichment signal strength
kable(out_SES)

Note: The x-axis denotes the fraction of bins with signal reads in the whole transcriptome, and the y-axis denotes the fraction of the normalized cumulative reads count. The black straight line divides IP sample into two parts: the background (the left-hand side) and the region enriched with immunoprecipitation signal (the right-hand side).The fraction of cumulated signal (Faction of Reads) shows the fraction of reads captured in the corresponding regions, and the difference in cumulated signals between the IP and Input samples (or the Scale Factor) is shown by the two cross-points between the two curves (IP curve and Input curve) and the black straight line.

4.2 Assesse the enrichment of m6A signal with C-test

Trumpet package also relies on C-test, which was used in the exomePeak package to predict RNA methylation sites. It compares two Poisson means to detect the regions enriched with m6A signal at different levels. The fraction of bins that enriched in the IP sample with m6A signal at different fold enrichment thresholds are counted and plotted. It is then possible to differ between different samples.

# Show the figures of c-test result
out_chest<-.ctest_evluate(result,IP_BAM,Input_BAM,contrast_IP_BAM,contrast_Input_BAM,condition1,condition2)

5 Comparison between different biological replicates

5.1 Hierarchical clustering and PCA analysis of samples

Hierarchical clustering and PCA analysis are applied to all the samples to identify possible outliers and to assess the relative similarity between samples and groups (if applicable).

hcluster<-.hcluster(result,IP_BAM,Input_BAM,contrast_IP_BAM,contrast_Input_BAM)

5.2 Gene-specific heterogeneity of read coverage

Compared with Input sample, the aligned reads are not evenly distributed on the same gene in IP samples, which may be due to the enrichment signal around the true methylation sites as well as possible bias and artifacts. Thus, the heterogeneity of read coverage is also assessed in Trumpet package with the mean and standard deviation (SD) of read count in each gene of each sample. We then use local regression to fit a curve based on mean and SD for each gene. The value of mean and SD have been logarithm transformed.

out_ms<-.ms_relation(result,IP_BAM,Input_BAM,contrast_IP_BAM,contrast_Input_BAM,condition1,condition2)

6 Sample consistency and reproducibility

This metric is used to assess the degree of difference between multiple biological replicates.In order to eliminate the difference in sequencing depth between theses biological replicates, we should first normalize the read count of the each bin in each sample. Conveniently,the read count can be normalized in this metrics automatically. Assume we have multiple biological replicates obtained from the same experimental condition, the mean and standard deviation of the read count of the same bin across different samples can be calculated respectively. Then, it is possible to fit the two with a local regression curves to show the consistency between multiple samples, or compare the reproducibility of the samples obtained from different conditions.

two_ms<-.two_cond_ms(result,IP_BAM,Input_BAM,contrast_IP_BAM,contrast_Input_BAM,condition1,condition2)

7 Typical metric values obtained on published datasets.

Due to the lack of a gold standard dataset and the variable m6A methylation level in different cell types, tissues and conditions, it is difficult to assert whether a dataset is of reasonable quality or not even provided with those metrics. To solve this dilemma, we collected 61 MeRIP-seq IP samples together with 59 corresponding Input samples from recent high impact studies and calculated three metrics that are the scale factor, enriched region and signal read count as the positive control for reference. So, users can compare their result caculated by these metric to the result of known 61 MeRIP-seq IP samples in the three density plots, with the vetical lines in each plot indicating users' IP samples (the data to be evaluated). It can been shown in the following plots.

enrichmentdist <- .ESES_enrich_DF(result, IP_BAM, Input_BAM, contrast_IP_BAM, contrast_Input_BAM)

Note: In this part, we can give the reasonable range of IP samples' enrichment signal based on known experiment data set from different cell lines under different conditions. The reasonable ranges of the percentage of enriched region, scale factor and percentage of signal read count are 12%~25%, 0.08~0.3 and 87%~95%, respectively.



skyhorsetomoon/Trumpet documentation built on May 30, 2019, 3:04 a.m.