R/exomePeak2.R
In exomePeak2: Bias Awared Peak Calling and Quantification for MeRIP-Seq

Documented in exomePeak2

#' @title Peak Calling and Peak Statistics Quantification on MeRIP-seq Dataset.
#'
#' @description \code{exomePeak2} conducts peak calling and peak statistics calculation from \strong{BAM} files of a MeRIP-seq experiment.
#' The function integrates the following steps of a standard MeRIP-seq data analysis pipeline.
#'
#' \enumerate{
#' \item Check and index the BAM files with \code{\link{scanMeripBAM}}.
#' \item Call modification peaks on exons with \code{\link{exomePeakCalling}}.
#' \item Calculate offset factors of GC content biases with \code{\link{normalizeGC}}.
#' \item Calculate (differential) modification statistics with the generalized linear model (GLM) using \code{\link{glmM}} or \code{\link{glmDM}}.
#' \item Export the peaks/sites statistics with user defined format by \code{\link{exportResults}}.
#' }
#'
#' See the help pages of the corresponding functions for the complete documentation.
#'
#' @details \code{\link{exomePeak2}} call RNA modification peaks and calculate peak statistics from \strong{BAM} files of a MeRIP-seq experiment.
#'
#' The transcript annotation (from either the \code{\link{TxDb}} object or the \strong{GFF} file) should be provided to perform analysis on exons.
#'
#' The \code{\link{BSgenome}} object is also required to perform the GC content bias adjustment.
#' If the \code{bsgenome} and the \code{genome} arguments are not provided (\code{= NULL}), the downstream analysis will proceed without GC content bias corrections.
#'
#' If the \strong{BAM} files in treated samples are provided at the arguments \code{bam_treated_ip} and \code{bam_treated_input}, the statistics of differential modification analysis on peaks/sites will be reported.
#'
#' Under the default setting, \code{\link{exomePeak2}} will save the results of (differential) modification analysis under a folder named \code{'exomePeak2_output'}.
#' The results generated include a \strong{BED} file and a \strong{CSV} table that stores the locations and statistics of the (differential) modified peaks/sites.
#'
#' @param bam_ip a \code{character} vector for the BAM file directories of the (control) IP samples.
#' @param bam_input a \code{character} vector for the BAM file directories of the (control) input samples.
#' @param bam_treated_ip a \code{character} vector for the BAM file directories of the treated IP samples.
#' @param bam_treated_input a \code{character} vector for the BAM file directories of the treated input samples.
#'
#' If the bam files do not contain treatment group, user should only fill the arguments of \code{BAM_ip} and \code{BAM_input}.
#'
#' @param paired_end a \code{logical} of whether the data comes from the Paired-End Library, \code{TRUE} if the data is Paired-End sequencing; default \code{FALSE}.
#' @param library_type a \code{character} specifying the protocal type of the RNA-seq library, can be one in \code{c("unstranded", "1st_strand", "2nd_strand")}; default \code{= "unstranded"}.
#'
#' \describe{
#' \item{\strong{unstranded}}{The randomly primed RNA-seq library type, i.e. both the strands generated during the first and the second strand sythesis are sequenced; example: Standard Illumina.}
#' \item{\strong{1st_strand}}{The first strand-specific RNA-seq library, only the strand generated during the first strand sythesis is sequenced; examples: dUTP, NSR, NNSR.}
#' \item{\strong{2nd_strand}}{The second strand-specific RNA-seq library, only the strand generated during the second strand sythesis is sequenced; examples: Ligation, Standard SOLiD.}
#' }
#'
#' @param txdb a \code{\link{TxDb}} object for the transcript annotation.
#'
#' If the \code{TxDb} object is not available, it could be a \code{character} string of the UCSC genome name which is acceptable by \code{\link{makeTxDbFromUCSC}}. For example: \code{"hg19"}.
#'
#' @param bsgenome a \code{\link{BSgenome}} object for the genome sequence information.
#'
#' If the \code{BSgenome} object is not available, it could be a \code{character} string of the UCSC genome name which is acceptable by \code{\link{getBSgenome}}. For example: \code{"hg19"}.
#'
#' @param genome a \code{character} string of the UCSC genome name which is acceptable by \code{\link{getBSgenome}} or/and \code{\link{makeTxDbFromUCSC}}. For example: \code{"hg19"}.
#'
#' By default, the argument = NA, it should be provided when the \code{BSgenome} or/and the \code{TxDb} object are not available.
#'
#' @param gff_dir optional, a \code{character} which specifies the directory toward a gene annotation GFF/GTF file, it is applied when the \code{TxDb} object is not available, default \code{= NULL}.
#'
#' @param fragment_length a positive integer number for the expected fragment length in nucleotides; default \code{= 100}.
#'
#' @param binding_length a positive integer number for the expected binding length of the anti-modification antibody in IP samples; default \code{= 25}.
#'
#' @param step_length a positive integer number for the shift distances of the sliding window; default \code{= binding_length}.
#'
#' @param peak_width a \code{numeric} value for the minimum width of the merged peaks; default \code{= fragment_length} .
#'
#' @param glm_type a \code{character} speciefies the type of Generalized Linear Model (GLM) fitted for the purpose of statistical inference during peak calling, which can be one of the \code{c("DESeq2", "NB", "Poisson")}.
#'
#' \describe{
#' \item{\strong{\code{DESeq2}}}{Fit the GLM defined in the function \code{\link{DESeq}}, which is the NB GLM with regulated estimation of the overdispersion parameters.}
#'
#' \item{\strong{\code{NB}}}{Fit the ordinary Negative Binomial (NB) GLM.}
#'
#' \item{\strong{\code{Poisson}}}{Fit the Poisson GLM.}
#' }
#'
#' By default, the DESeq2 GLMs are fitted on the data set with > 1 biological replicates for both the IP and input samples, the Poisson GLM will be fitted otherwise.
#'
#' @param pc_count_cutoff a \code{numeric} value for the cutoff on average window's reads count in peak calling; default \code{= 5}.
#'
#' @param bg_count_cutoff a \code{numeric} value for the cutoff on average window's reads count in background identification; default \code{= 50}.
#'
#' @param p_cutoff a \code{numeric} value for the cutoff on p values in peak calling; default \code{= 1e-05}.
#'
#' @param p_adj_cutoff a \code{numeric} value for the cutoff on Benjamini Hochberg adjusted p values in peak calling; default \code{= NULL}.
#'
#' @param log2FC_cutoff a \code{numeric} value for the cutoff on log2 IP over input fold changes in peak calling; default \code{= 1}.
#'
#' @param consistent_peak a \code{logical} of whether the positive peaks returned should be consistent among all the replicates; default \code{= FALSE}.
#'
#' @param consistent_log2FC_cutoff a \code{numeric} for the modification log2 fold changes cutoff in the peak consisency calculation; default = 1.
#'
#' @param consistent_fdr_cutoff a \code{numeric} for the BH adjusted C-test p values cutoff in the peak consistency calculation; default { = 0.05}. Check \code{\link{ctest}}.
#'
#' @param alpha a \code{numeric} for the binomial quantile used in the consitent peak filter; default\code{ = 0.05}.
#'
#' @param p0 a \code{numeric} for the binomial proportion parameter used in the consistent peak filter; default \code{= 0.8}.
#'
#' For a peak to be consistently methylated, the minimum number of significant enriched replicate pairs is defined as the 1 - alpha quantile of a binomial distribution with p = p0 and N = number of possible pairs between replicates.
#'
#' The consistency defined in this way is equivalent to the rejection of an exact binomial test with null hypothesis of p < p0 and N = replicates number of IP * replicates number of input.
#'
#' @param parallel a \code{logical} value indicating whether to use parallel computation, it will require > 16GB memory if \code{parallel = TRUE}; default \code{= FALSE}.
#'
#' @param mod_annot a \code{\link{GRanges}} object for user provided single based RNA modification annotation.
#'
#' If user provides the single based RNA modification annotation, exomePeak2 will perform reads count and (differential) modification quantification on the provided annotation.
#'
#' The single base annotation will be flanked by length = floor(fragment_length - binding_length/2) to account for the fragment length of the sequencing library.
#'
#' @param manual_background  a \code{\link{GRanges}} object for the user provided unmodified background; default \code{= NULL}.
#'
#' @param correct_GC_bg a \code{logical} value of whether to estimate the GC content linear effect on background regions during modification level quantification; default \code{= TRUE}.
#'
#' If \code{= TRUE}, it could lead to a more accurate estimation of GC content bias for the RNA modifications that are highly biologically related to GC content.
#'
#' @param qtnorm a \code{logical} of whether to perform subset quantile normalization after the GC content linear effect correction; default \code{= FALSE}.
#'
#' If \code{qtnorm = TRUE}, subset quantile normalization will be applied within the IP and input samples seperately to account for the inherent differences between the marginal distributions of IP and input samples.
#'
#' @param background_method a \code{character} specifies the method of finding background regions for peak detection, i.e. to identify the windows without modification signal. It could be one of "Gaussian_mixture", "m6Aseq_prior", "manual", and "all";  default \code{= "all"}.
#'
#' In order to accurately account for the technical variations, it is often important to estimate the sequencing depth and GC content linear effects on windows without modification signals.
#'
#' The following methods are supported in \code{ExomePeak2} to differentiate the no modification background windows from the modification containig windows.
#'
#' \describe{
#'
#'  \item{\strong{\code{all}}}{Use all windows as the background.
#'  This choise assumes no differences in the effects of technical features between the background and the modification regions.}
#'
#'  \item{\strong{\code{Gaussian_mixture}}}{The background is identified by Multivariate Gaussian Mixture Model (MGMM) with 2 mixing components on the vectors containing methylation level estimates and GC content, the background regions are predicted by the Bayes Classifier on the learned GMM.}
#'
#'  \item{\strong{\code{m6Aseq_prior}}}{The background is identified by the prior knowledge of m6A topology, the windows that are not overlapped with long exons (exon length >= 400bp) and 5'UTR are treated as the background windows.
#'
#'  This type of background should not be used if the MeRIP-seq data is not targetting on m6A methylation.
#'
#'  }
#'
#'  \item{\strong{\code{manual}}}{The background regions are defined by the user manually at the argument \code{manual_background}.}
#'
#'  }
#'
#'
#' @param LFC_shrinkage a \code{character} for the method of emperical bayes shrinkage on log2FC, could be one of \code{c("apeglm", "ashr", "Gaussian", "none")}; Default \code{= "apeglm"}.
#'
#' see \code{\link{lfcShrink}} for more details; if "none" is selected, only the MLE will be returned.
#'
#' @param table_style a \code{character} for the style of the table being exported, could be one of \code{c("bed", "granges")}; Default \code{= "bed"}.
#'
#' @param export_results a \code{logical} of whether to save the results on disk; default \code{= TRUE}.
#'
#' @param export_format a \code{character} vector for the format(s) of the result being exported, could be the subset of \code{c("CSV","BED","RDS")}; Default \code{= c("CSV","BED","RDS")}.
#'
#' @param save_plot_GC a \code{logical} of whether to generate the plots for GC content bias assessment; default \code{= TRUE}.
#'
#' @param save_plot_analysis a \code{logical} of whether to generate the plots for genomic analysis on modification sites; default \code{= FALSE}.
#'
#' @param save_plot_name a \code{character} for the name of the plots being saved; Default \code{= "Plot"}.
#'
#' @param save_dir a \code{character} for the name of the directory being saved; Default \code{= "exomePeak2_output"}.
#'
#' @param peak_calling_mode a \code{character} specifies the scope of peak calling on genome, can be one of \code{c("exon", "full_transcript", "whole_genome")}; Default \code{= "exon"}.
#'
#' \describe{
#' \item{\strong{\code{exon}}}{generate sliding windows on exon regions.}
#'
#' \item{\strong{\code{full_transcript}}}{generate sliding windows on the full transcripts (include both introns and exons).}
#'
#' \item{\strong{\code{whole_genome}}}{generate sliding windows on the whole genome (include introns, exons, and the intergenic regions).}
#' }
#'
#' P.S. The full transcript mode and the whole genome mode demand big memory size (> 4GB) for large genomes.
#'
#' @return
#' a \code{\link{SummarizedExomePeak}} object.
#'
#' @examples
#'
#' #Specify File Directories
#'
#' GENE_ANNO_GTF = system.file("extdata", "example.gtf", package="exomePeak2")
#'
#' f1 = system.file("extdata", "IP1.bam", package="exomePeak2")
#' f2 = system.file("extdata", "IP2.bam", package="exomePeak2")
#' f3 = system.file("extdata", "IP3.bam", package="exomePeak2")
#' f4 = system.file("extdata", "IP4.bam", package="exomePeak2")
#' IP_BAM = c(f1,f2,f3,f4)
#' f1 = system.file("extdata", "Input1.bam", package="exomePeak2")
#' f2 = system.file("extdata", "Input2.bam", package="exomePeak2")
#' f3 = system.file("extdata", "Input3.bam", package="exomePeak2")
#' INPUT_BAM = c(f1,f2,f3)
#'
#' # Peak Calling
#'
#' sep <- exomePeak2(bam_ip = IP_BAM,
#'                   bam_input = INPUT_BAM,
#'                   gff_dir = GENE_ANNO_GTF,
#'                   genome = "hg19",
#'                   paired_end = FALSE)
#' sep
#'
#' # Differential Modification Analysis on Modification Peaks (Comparison of Two Conditions)
#'
#' f1 = system.file("extdata", "treated_IP1.bam", package="exomePeak2")
#' TREATED_IP_BAM = c(f1)
#' f1 = system.file("extdata", "treated_Input1.bam", package="exomePeak2")
#' TREATED_INPUT_BAM = c(f1)
#'
#' sep <- exomePeak2(bam_ip = IP_BAM,
#'                   bam_input = INPUT_BAM,
#'                   bam_treated_input = TREATED_INPUT_BAM,
#'                   bam_treated_ip = TREATED_IP_BAM,
#'                   gff_dir = GENE_ANNO_GTF,
#'                   genome = "hg19",
#'                   paired_end = FALSE)
#' sep
#'
#' # Modification Level Quantification on Single Based Modification Annotation
#'
#' f2 = system.file("extdata", "mod_annot.rds", package="exomePeak2")
#'
#' MOD_ANNO_GRANGE <- readRDS(f2)
#'
#' sep <- exomePeak2(bam_ip = IP_BAM,
#'                   bam_input = INPUT_BAM,
#'                   gff_dir = GENE_ANNO_GTF,
#'                   genome = "hg19",
#'                   paired_end = FALSE,
#'                   mod_annot = MOD_ANNO_GRANGE)
#' sep
#'
#' # Differential Modification Analysis on Single Based Modification Annotation
#'
#' sep <- exomePeak2(bam_ip = IP_BAM,
#'                   bam_input = INPUT_BAM,
#'                   bam_treated_input = TREATED_INPUT_BAM,
#'                   bam_treated_ip = TREATED_IP_BAM,
#'                   gff_dir = GENE_ANNO_GTF,
#'                   genome = "hg19",
#'                   paired_end = FALSE,
#'                   mod_annot = MOD_ANNO_GRANGE)
#' sep
#'
#'
#' @seealso \code{\link{exomePeakCalling}}, \code{\link{glmM}}, \code{\link{glmDM}}, \code{\link{normalizeGC}}, \code{\link{exportResults}}, \code{\link{plotLfcGC}}
#' @importFrom GenomicAlignments summarizeOverlaps
#' @importFrom Rsamtools asMates
#' @importFrom utils capture.output
#' @import GenomicRanges
#' @import SummarizedExperiment
#'
#' @docType methods
#'
#' @name exomePeak2
#'
#' @rdname exomePeak2
#'
#' @export
#'
exomePeak2 <- function(bam_ip = NULL,
                       bam_input = NULL,
                       bam_treated_ip = NULL,
                       bam_treated_input = NULL,
                       txdb = NULL,
                       bsgenome = NULL,
                       genome = NA,
                       gff_dir = NULL,
                       mod_annot = NULL,
                       paired_end = FALSE,
                       library_type = c("unstranded",
                                        "1st_strand",
                                        "2nd_strand"),
                       fragment_length = 100,
                       binding_length = 25,
                       step_length = binding_length,
                       peak_width = fragment_length/2,
                       pc_count_cutoff = 5,
                       bg_count_cutoff = 50,
                       p_cutoff = 1e-05,
                       p_adj_cutoff = NULL,
                       log2FC_cutoff = 1,
                       consistent_peak = FALSE,
                       consistent_log2FC_cutoff = 1,
                       consistent_fdr_cutoff = 0.05,
                       alpha = 0.05,
                       p0 = 0.8,
                       parallel = FALSE,
                       background_method = c("all",
                                             "Gaussian_mixture",
                                             "m6Aseq_prior",
                                             "manual"),
                       manual_background = NULL,
                       correct_GC_bg = TRUE,
                       qtnorm = FALSE,
                       glm_type = c("DESeq2",
                                    "Poisson",
                                    "NB"),
                       LFC_shrinkage = c("apeglm",
                                         "ashr",
                                         "Gaussian",
                                         "none"),
                       export_results = TRUE,
                       export_format = c("CSV",
                                         "BED",
                                         "RDS"),
                       table_style = c("bed",
                                       "granges"),
                       save_plot_GC = TRUE,
                       save_plot_analysis = FALSE,
                       save_plot_name = "",
                       save_dir = "exomePeak2_output",
                       peak_calling_mode = c("exon",
                                             "full_tx",
                                             "whole_genome")
                      ) {

LFC_shrinkage <- match.arg(LFC_shrinkage)

stopifnot(all(export_format %in% c("CSV", "BED", "RDS")))

table_style <- match.arg(table_style)

background_method <- match.arg(background_method)

glm_type <- match.arg(glm_type)

peak_calling_mode <- match.arg(peak_calling_mode)

if(!is.null(bam_treated_ip) & LFC_shrinkage == "Gaussian"){
  stop("Gaussian prior is not applicable for differential modification analysis.")
}

stopifnot(fragment_length > 0)

stopifnot(step_length > 0)

stopifnot(peak_width > 0)

stopifnot(log2FC_cutoff >= 0)

stopifnot(pc_count_cutoff >= 0)

if(!is.na(genome)) {
   if(!is(bsgenome,"BSgenome")) bsgenome = genome
   if(!is(txdb,"TxDb") & is.null(gff_dir)) txdb = genome
}

if(!is.null(bsgenome)) {
  bsgenome <- getBSgenome(bsgenome)
}

if(!is.null(gff_dir) & is.null(txdb)) {
  message("Make the TxDb object ... ", appendLF = FALSE)
  txdb <- suppressMessages( makeTxDbFromGFF(gff_dir) )
  message("OK")
} else {
  if (is.null(txdb)) {
    stop("Require argument of txdb or gff_dir for transcript annotation.")
  }

  if (!is(txdb, "TxDb")) {

    message("Make the TxDb object ... ", appendLF = FALSE)

    txdb <- suppressMessages( makeTxDbFromUCSC(txdb) )

    message("OK")

  }
}

if( peak_calling_mode != "exon") {
   txdb <- convertTxDb(txdb, type = peak_calling_mode)
}

argg <- as.list(environment()) #Get arguments information

if(length(argg$library_type) > 1) argg$library_type = argg$library_type[1]

if(length(argg$export_format) > 1) argg$export_format = argg$export_format[1]

argg <- lapply(argg, capture.output)

#Check the completeness of the genome annotation

merip_bam_lst <- scanMeripBAM(
bam_ip = bam_ip,
bam_input = bam_input,
bam_treated_ip = bam_treated_ip,
bam_treated_input = bam_treated_input,
paired_end = paired_end,
library_type = library_type
)

sep <- exomePeakCalling(merip_bams = merip_bam_lst,
                        txdb = txdb,
                        bsgenome = bsgenome,
                        gff_dir = gff_dir,
                        fragment_length = fragment_length,
                        binding_length = binding_length,
                        step_length = binding_length,
                        peak_width = fragment_length/2,
                        pc_count_cutoff = pc_count_cutoff,
                        bg_count_cutoff = bg_count_cutoff,
                        p_cutoff = p_cutoff,
                        p_adj_cutoff = p_adj_cutoff,
                        log2FC_cutoff = log2FC_cutoff,
                        consistent_peak = consistent_peak,
                        consistent_log2FC_cutoff = consistent_log2FC_cutoff,
                        consistent_fdr_cutoff = consistent_fdr_cutoff,
                        alpha = alpha,
                        p0 = p0,
                        parallel = parallel,
                        mod_annot = mod_annot,
                        background_method = background_method,
                        manual_background = manual_background,
                        correct_GC_bg = correct_GC_bg,
                        glm_type = glm_type,
                        qtnorm = qtnorm
)

sep <- estimateSeqDepth(sep)

if(!is.null(bsgenome)) {

  message("Estimate offsets of GC content biases on modification peaks/sites ... ", appendLF = FALSE)

  sep <- normalizeGC(sep,
                     feature = ifelse(correct_GC_bg,"Background","All"),
                     qtnorm = qtnorm)

  message("OK")

}

if(any(sep$design_Treatment)){
  message("Differential modification analysis with interactive GLM ... ", appendLF = FALSE)
  sep <- suppressMessages( glmDM(sep, LFC_shrinkage = LFC_shrinkage, glm_type = glm_type) )
  message("OK")
} else {
  sep <- glmM(sep, LFC_shrinkage = LFC_shrinkage, glm_type = glm_type)
}

if( export_results ) {

  exportResults(sep,
                format = export_format,
                inhibit_filter = (!is.null( mod_annot ))|(any(grepl("Diff",colnames(exomePeak2Results( sep ))))),
                table_style = table_style,
                save_dir = save_dir)

  writeLines( format_argg(argg) , file.path(save_dir, "RunInfo.txt") )

}

if( !is.null(bsgenome) & save_plot_GC ) {

suppressMessages(
plotLfcGC(sep = sep,
          save_pdf_prefix = save_plot_name,
          save_dir = save_dir)
)

}


if (save_plot_analysis) {

  # The function of Guitar plot is currently inhibited in exomePeak2, it will be recovered when the Guitar package has fixed some critical issues.
  #
  # if (!require(Guitar)) {
  #   warning(
  #     "the 'Guitar' package is not installed, skipping the distribution plot on travis coordinate."
  #   )
  # } else {
  #   message("Generating the distribution plot on travis coordinate ...")
  #   plotGuitar(
  #     sep,
  #     txdb = txdb,
  #     save_pdf_prefix = save_plot_name,
  #     save_dir = save_dir
  #   )
  # }

  plotExonLength(sep,
                 txdb = txdb,
                 save_pdf_prefix = save_plot_name,
                 save_dir = save_dir)

  plotReadsGC(sep = sep,
              save_pdf_prefix = save_plot_name,
              save_dir = save_dir)
}

return(sep)

}