R/ChipDataSet-generics.R

####------------------- predictTssOverlap -------------------####
#' predictTssOverlap
#'
#' The function classifies ChIP-seq peaks on gene associated and background using
#' classification model based on a logistic regression.
#'
#' @name predictTssOverlap
#' @docType methods
#'
#' @param object A \code{\link{ChipDataSet}} object.
#' @param feature \code{Character}. Specify feature(s) that are be used as
#'     predictors in the classification model. By default, all the features
#'     stored in the \code{\link{ChipDataSet}} object will be used.
#' @param p \code{Numeric}. The proportion of data that is used to train the
#'     model. Default: 0.8.
#'
#' @details
#'     In order to improve the accuracy of the model the data is internally
#'     partitioned into a training and testing data sets. A repeated 10-Fold
#'     Cross-Validation is used to calculate performance measures on the
#'     training data set and to prevent over-fitting.
#'
#'     The model fit and validation is internally acomplised by the functions
#'     implemented in the \href{http://topepo.github.io/caret/index.html}{caret}
#'     package.
#'
#' @return
#'     The slot \code{tssOverlapPrediction} of the provided
#'     \code{TranscriptionDataSet} object will be updated by the the following
#'     elements: 'predicted.tssOverlap', 'confusionMatrix', 'logFitSummary' and
#'     'roc'.
#'
#' @seealso
#'     \code{\link{ChipDataSet}}
#'     \code{\link{constructCDS}}
#'     \href{http://topepo.github.io/caret/index.html}{caret}
#'
#' @examples
#' ### Load ChipDataSet object
#' data(cds)
#'
#' predictTssOverlap(object = cds, feature = "pileup", p = 0.75)
#'
#' ### View a short summary of the gene associated peaks prediction
#' cds
#'
#' ### View peaks and associated prediction
#' getPeaks(cds)
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @rdname predictTssOverlap-methods
setGeneric(name = "predictTssOverlap",
           def = function(object, feature, p = 0.8)
                   standardGeneric("predictTssOverlap")
)


####------------------- predictStrand -------------------####
#' predictStrand
#'
#' The function evaluates transcription initiation within a peak region by
#' comparing RNA-seq read densities upstream and downstream of an empirically
#' determined transcription start sites. Putative transcription of both forward
#' and reverse genomic strands is tested and the results are stored with each
#' ChIP-seq peak.
#'
#' @name predictStrand
#' @docType methods
#'
#' @param cdsObj A \code{\link{ChipDataSet}} object.
#' @param tdsObj A \code{\link{TranscriptionDataSet}} object.
#' @param coverage.cutoff \code{Numeric}. A cutoff value to discard regions with
#'     the low fragments coverage, representing expression noise. By default,
#'     the value stored in the \code{coverageCutoff} slot of the supplied
#'     \code{TranscriptionDataSet} object is used. The optimal cutoff value can
#'     be calculated by \code{\link{estimateBackground}} function call.
#' @param quant.cutoff \code{Numeric}. A cutoff value for the cumulative
#'     distribution of the RNA-seq signal along the ChIP-seq peak region. Must
#'     be in a range (0, 1). For the details, see step 1 in the "Details"
#'     section below. Default: 0.1.
#' @param win.size \code{Numeric}. The size of the q1 and q2 regions
#'     flanking transcription start position at the 5' and 3', respectively.
#'     For the details, see step 2 in the "Details"  section below.
#'     Default: 2500.
#' @param prob.cutoff \code{Numeric}. A cutoff value for the probability of reads
#'     to be sampled from the q2 flanking region. If not supplied, the value
#'     estimated from the data will be used. Must be in a range (0, 1). For the
#'     details, see step 6 in the "Details" section below.
#'
#' @details
#'     RNA-seq data is incorporated to find direct evidence of active
#'     transcription from every putatively gene associated peak. In order to do
#'     this, we determine the 'strandedness' of the ChIP-seq peaks, using strand
#'      specific RNA-seq data. The following assumptions are made in order to
#'      retrieve the peak 'strandedness':
#'     \itemize{
#'         \item The putatively gene associated ChIP-seq peaks are commonly
#'             associated with transcription initiation.
#'
#'         \item This transcription initiation occurs within the ChIP peak region.
#'
#'         \item When a ChIP peak is associated with a transcription initiation
#'             event, we expect to see a strand-specific increase in RNA-seq
#'             fragment count downstream the transcription initiation site.
#'     }
#'
#'    Each peak in the data set is tested for association with transcription
#'    initiation on both strands of DNA. Steps 1-5 are performed for both
#'    forward and reverse DNA strand separately and step 6 combines the data
#'    from both strands. If the peak is identified as associated with the
#'    transcription on both strands, than it is considered to be a bidirectional.
#'
#'    ChIP peak 'strandedness' prediction steps:
#'     \enumerate{
#'         \item Identify a location within the ChIP-seq peak near the
#'         transcription start site. This is accomplished by calculating the
#'         cumulative distribution of RNA-seq fragments within a peak region.
#'         The position is determined where 100\% - 'quant.cutoff' * 100\% of
#'         RNA-seq fragments are located downstream. This approach performs well
#'         on both gene-poor and gene-dense regions where transcripts may overlap.
#'
#'         \item Two equally sized regions are defined (q1 and q2), flanking the
#'         position identified in (1) on both sides. RNA-seq fragments are
#'         counted in each region.
#'
#'         \item ChIP peaks with an RNA-seq fragment coverage below an estimated
#'         threshold are discarded from the analysis.
#'
#'         \item The probability is calculated for RNA-seq fragments to be
#'         sampled from either q1 or q2. Based on the assumptions we stated
#'         above, a ChIP peak that is associated with transcription initiation
#'         should have more reads in q2 (downstream of the transcription start
#'         position) compared to q1, and subsequently, the probability of a
#'         fragment being sampled from q2 would be higher.
#'
#'         \item ChIP-seq peaks are divided into gene associated and background
#'         based on the prediction.
#'
#'         \item Iteratively, the optimal P(q2) threshold is identified, which
#'         balances out the False Discovery Rate (FDR) and False Negative Rate
#'         (FNR). Peaks with the P(q2) exceeding the estimated threshold are
#'         considered to be associated with the transcription initiation event.
#'     }
#'
#' @return The slot \code{strandPrediction} of the provided
#'     \code{\link{ChipDataSet}} object will be updated by the the following
#'     elements: 'predicted.strand', 'probability.cutoff', 'results.plus' and
#'     'results.minus'.
#'
#' @seealso \code{\link{ChipDataSet}} \code{\link{constructCDS}}
#'
#' @examples
#' ### Load TranscriptionDataSet object
#' data(tds)
#'
#' ### Load ChipDataSet object
#' data(cds)
#'
#' ### Classify peaks on gene associated and background
#' predictTssOverlap(object = cds, feature = "pileup", p = 0.75)
#'
#' ### Predict peak 'strand'
#' predictStrand(cdsObj = cds, tdsObj = tds, coverage.cutoff = 5,
#' quant.cutoff = 0.1, win.size = 2500)
#'
#' ### View a short summary of the 'strand' prediction
#' cds
#'
#' ### View 'strand' prediction
#' getPeaks(cds)
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @rdname predictStrand-methods
setGeneric(name = "predictStrand",
           def = function(cdsObj, tdsObj, coverage.cutoff,
                          quant.cutoff = 0.1, win.size = 2500, prob.cutoff)
                   standardGeneric("predictStrand")
)


####------------------- addFeature -------------------####
#' addFeature
#'
#' Add feature(s) to the \code{\link{ChipDataSet}} object.
#'
#' @name addFeature
#' @docType methods
#'
#' @param object A \code{\link{ChipDataSet}} object.
#' @param feature \code{List}. User specified characteristics of the peaks.
#'
#' @return
#'     The user specified feature(s) will be added to the slot \code{features}
#'     of the provided  \code{\link{ChipDataSet}} object.
#'
#' @seealso \code{\link{constructCDS}} \code{\link{predictTssOverlap}}
#'
#' @examples
#' ### Load ChipDataSet object
#' data(cds)
#'
#' ### As an example create some fake data
#' N <- length(getPeaks(cds))
#' fake.data <- rnorm(n = N)
#'
#' addFeature(object = cds, feature = list(fake = fake.data))
#'
#' ### View newly added feature
#' getPeaks(cds)
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @rdname addFeature-methods
setGeneric(name = "addFeature",
           def = function(object, feature)
                   standardGeneric("addFeature")
)


####------------------- getPeaks -------------------####
#' getPeaks
#'
#' Retrieve ChIP-seq peak information from the \code{\link{ChipDataSet}} object.
#'
#' @name getPeaks
#' @docType methods
#'
#' @param object A \code{\link{ChipDataSet}} object.
#'
#' @return A \code{\link[GenomicRanges]{GRanges}} object.
#'
#' @seealso
#'     \code{\link{constructCDS}}
#'     \code{\link{predictTssOverlap}}
#'     \code{\link{predictStrand}}
#'
#' @examples
#' ### Load ChipDataSet object
#' data(cds)
#'
#' getPeaks(cds)
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @rdname getPeaks-methods
setGeneric(name = "getPeaks",
           def = function(object)
                   standardGeneric("getPeaks")
)


####------------------- getGenomicAnnot -------------------####
#' getGenomicAnnot
#'
#' Retrieve genomic distribution of ChIP-seq peaks at distinct genomic features
#' (exons, introns, TSSs, intergenic regions)
#'
#' @name getGenomicAnnot
#' @docType methods
#'
#' @param object A \code{\link{ChipDataSet}} object.
#'
#' @return
#'     A four column \code{Data.Frame}, storing information about observed and
#'     expected number of peaks at distinct genomic features.
#'
#' @details
#'     A simple quality check of the supplied ChIP-seq peaks can be performed by
#'     investigating their genomic distribution. Ideally, these peaks should
#'     demonstrate substantial enrichment at TSS regions. Enrichment of the
#'     peaks at a given genomic feature (e.g. TSS) is defined as the ratio
#'     between the observed and expected number of peaks. The expected number of
#'     peaks is calculated from the proportion of the genome covered by the
#'     given genomic feature.
#'
#' @seealso \code{\link{constructCDS}}
#'
#' @examples
#' ### Load ChipDataSet object
#' data(cds)
#'
#' getGenomicAnnot(cds)
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @rdname getGenomicAnnot-methods
setGeneric(name = "getGenomicAnnot",
           def = function(object)
                   standardGeneric("getGenomicAnnot")
)


####------------------- getConfusionMatrix -------------------####
#' getConfusionMatrix
#'
#' Retrieve a cross-tabulation of observed and predicted classes (prediction of
#' gene associated peaks) with associated statistics.
#'
#' @name getConfusionMatrix
#' @docType methods
#'
#' @param object A \code{\link{ChipDataSet}} object.
#'
#' @return
#'     An object of \code{Confusion Matrix} class. For the details see
#'     \href{http://topepo.github.io/caret/index.html}{caret} package.
#'
#' @seealso
#'     \code{\link{predictTssOverlap}} \code{\link[caret]{confusionMatrix}}
#'
#' @examples
#' ### Load ChipDataSet object
#' data(cds)
#'
#' getConfusionMatrix(cds)
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @rdname getConfusionMatrix-methods
setGeneric(name = "getConfusionMatrix",
           def = function(object)
                   standardGeneric("getConfusionMatrix")
)


####------------------- getPredictorSignificance -------------------####
#' getPredictorSignificance
#'
#' Retrieve significance of each predictor used in the classification model fit
#' (prediction of gene associated peaks).
#'
#' @name getPredictorSignificance
#' @docType methods
#'
#' @param object A \code{\link{ChipDataSet}} object.
#'
#' @return A vector of p-values.
#'
#' @seealso \code{\link{predictTssOverlap}}
#'
#' @examples
#' ### Load ChipDataSet object
#' data(cds)
#'
#' getPredictorSignificance(cds)
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @rdname getPredictorSignificance-methods
setGeneric(name = "getPredictorSignificance",
           def = function(object)
                   standardGeneric("getPredictorSignificance")
)


####------------------- getQuadProb -------------------####
#' getQuadProb
#'
#' Retrieve all internal calculations performed by \code{\link{predictStrand}}
#' function.
#'
#' @name getQuadProb
#' @docType methods
#'
#' @param object A \code{\link{ChipDataSet}} object.
#' @param strand \code{Character}. The strand to extract calculations for.
#'     One of ["+", "-"].
#'
#' @return
#'     A nine column \code{Data.Frame}, where each row corresponds to
#'     a ChIP-seq peak and each column keeps one of the intermediate calculations:
#'     \itemize{
#'         \item \code{max.cov} - maximum coverage of the RNA-seq fragments
#'             inside the peak region.
#'
#'         \item \code{pass.cov.treshold} - whether the \code{max.cov} exceeds
#'             the \code{coverage.cutoff}, either user defined or estimated from
#'             RNA-seq data by \code{\link{estimateBackground}} function call and
#'             stored in \code{\link{TranscriptionDataSet}} object.
#'
#'         \item \code{q1q2.sepline.coord} - genomic coordinate corresponding to
#'             the transcription start position inside the peak region.
#'
#'         \item \code{q1.coord} - genomic coordinates of q1.
#'
#'         \item \code{q2.coord} - genomic coordinates of q2.
#'
#'         \item \code{q1.count} - total number of fragments in q1.
#'
#'         \item \code{q2.count} - total number of fragments in q2.
#'
#'         \item \code{q1.prob} - probability of a fragment being sampled from
#'             the q1.
#'
#'         \item \code{q2.prob} - probability of a fragment being sampled from
#'            the q2.
#'     }
#'
#' @seealso \code{\link{predictStrand}}
#'
#' @examples
#' ### Load ChipDataSet object
#' data(cds)
#'
#' ### Load TranscriptionDataSet object
#' data(tds)
#'
#' head(getQuadProb(cds, strand = "+"))
#' head(getQuadProb(cds, strand = "-"))
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @rdname getQuadProb-methods
setGeneric(name = "getQuadProb",
           def = function(object, strand)
                   standardGeneric("getQuadProb")
)


####------------------- getProbTreshold -------------------####
#' getProbTreshold
#'
#' Retrieve estimated P(q2) threshold, used to select peaks with a putative
#' transcription initiation event.
#'
#' @name getProbTreshold
#' @docType methods
#'
#' @param object A \code{\link{ChipDataSet}} object.
#'
#' @return Estimated P(q2) threshold.
#'
#' @seealso \code{\link{predictStrand}}
#'
#' @examples
#' ### Load ChipDataSet object
#' data(cds)
#'
#' getProbTreshold(cds)
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @rdname getProbTreshold-methods
setGeneric(name = "getProbTreshold",
           def = function(object)
                   standardGeneric("getProbTreshold")
)


####------------------- plotGenomicAnnot -------------------####
#' plotGenomicAnnot
#'
#' Visualize genomic distribution of ChIP-seq peaks.
#'
#' @name plotGenomicAnnot
#' @docType methods
#'
#' @param object A \code{\link{ChipDataSet}} object.
#' @param plot.type \code{Character}. One of ("distrib", "enrich").
#'     Default: "enrich".
#' @param xlab \code{Character}. Title of the x-axis.
#' @param ylab \code{Character}. Title of the y-axis.
#' @param color A character vector of length four, specifying colors for
#'     distinct genomic features (TSSs, exons, introns, intergenic regions).
#'
#' @details
#'     Genomic distribution of the peaks can be visualized in two ways, either
#'     by observing the total number of peaks overlapping given genomic feature
#'     or by looking at the enrichment levels.
#'
#' @return ggplot2 object.
#'
#' @seealso \code{\link{constructCDS}}
#'
#' @examples
#' ### Load ChipDataSet object
#' data(cds)
#'
#' ### Plot the total number of peaks overlapping distinct genomic features
#' plotGenomicAnnot(object = cds, plot.type = "distrib")
#'
#' ### Plot enrichment of the peaks at a given genomic feature (e.g. TSS)
#' plotGenomicAnnot(object = cds, plot.type = "enrich")
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @import ggplot2
#' @rdname plotGenomicAnnot-methods
setGeneric(name = "plotGenomicAnnot",
           def = function(object, plot.type = c("distrib", "enrich"), xlab, ylab, color)
                   standardGeneric("plotGenomicAnnot")
)


####------------------- plotFeatures -------------------####
#' plotFeatures
#'
#' Visualize the relations between predictors and response variable
#' ('tssOverlap').
#'
#' @name plotFeatures
#' @docType methods
#'
#' @param object A \code{\link{ChipDataSet}} object.
#' @param plot.type One of ["box", "density"]. Default: "box"
#' @param feature Feature to plot. By default, all the features are plotted.
#' @param ncol \code{Numeric}. Arrange individual plots in columns. By default,
#'     the number of columns correspond to the number of features used for
#'     plotting.
#' @param xlab \code{Character}. Title of the x-axis
#' @param ylab \code{Character}. Title of the y-axis.
#' @param color A character vector of length two. Default: ["#E41A1C","#377EB8"].
#' @param alpha Color transparency. In a range [0, 1]. Default: 1.
#'
#' @details
#'    In order to discriminate between functional or gene associated peaks and
#'    non-functional or background peaks, each peak in the data set is
#'    characterized by several features. Moreover, the user might supply her/his
#'    own list of features with the \code{\link{addFeature}}. Prior to fitting
#'    the logistic model, the relations between predictors and response variable
#'    (tssOverlap) can be explored with \code{plotFeatures}. Based on the plots,
#'    poor predictors can be excluded from the analysis to improve the model
#'    fit.
#'
#' @return ggplot2 object.
#'
#' @seealso \code{\link{constructCDS}}
#'
#' @examples
#' ### Load ChipDataSet object
#' data(cds)
#'
#' ### The data can be plotted in two ways
#' ### As a boxplot
#' plotFeatures(object = cds, plot.type = "box")
#'
#' ### Or as a density plot
#' plotFeatures(object = cds, plot.type = "density")
#'
#' ### Additionally, only the subset of features can be shown
#' plotFeatures(object = cds, plot.type = "box", feature = c("pileup", "length"))
#'
#' ### The position of the graphs on the plot, can be adjusted by 'ncol' argument
#' plotFeatures(object = cds, plot.type = "box", ncol = 2)
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @rdname plotFeatures-methods
setGeneric(name = "plotFeatures",
           def = function(object,
                          plot.type = c("box", "density"),
                          feature,
                          ncol,
                          xlab,
                          ylab,
                          color = c("#E41A1C","#377EB8"),
                          alpha = 1)
                   standardGeneric("plotFeatures")
)


####------------------- plotROC -------------------####
#' plotROC
#'
#' Visualize the performance of the classification model fit (prediction of the
#' gene associated peaks).
#'
#' @name plotROC
#' @docType methods
#'
#' @param object A \code{\link{ChipDataSet}} object.
#' @param ... Further arguments passed to plot.
#'
#' @details
#'     The \code{plotROC} is a simple wrapper for the \code{plot} function
#'     implemented in \code{\link{pROC}} package.
#'
#'     The \href{https://en.wikipedia.org/wiki/Receiver_operating_characteristic}{ROC}
#'     curve is created by plotting the true positive rate (sensitivity) against
#'     the false positive rate (1 - specificity) at various threshold settings.
#'     The closer the curve follows the left-hand border and then the top border
#'     of the ROC space, the more accurate the test. The area under the curve
#'     (AUC) is a measure of accuracy.
#'
#'
#' @return ROC plot.
#'
#' @seealso \code{\link{predictTssOverlap}}
#'
#' @examples
#' ### Load ChipDataSet object
#' data(cds)
#'
#' ### Classify peaks on gene associated and background
#' predictTssOverlap(object = cds, feature = "pileup", p = 0.75)
#'
#' ### Visualize the performance of the model fit
#' plotROC(object = cds, col = "red3", grid = TRUE, auc.polygon = TRUE)
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @rdname plotROC-methods
setGeneric(name = "plotROC",
           def = function(object, ...)
                   standardGeneric("plotROC")
)


####------------------- peaksToBed -------------------####
#' peaksToBed
#'
#' A convenient way to explore output of the predictions made on the ChIP peaks
#' is to visualize them in the
#' \href{https://genome.ucsc.edu/}{UCSC genome browser}. The peaksToBed function
#' returns a file in
#' \href{https://genome.ucsc.edu/FAQ/FAQformat.html#format1}{BED} format, which
#' can be uploaded directly to the genome browser. To improve the visual
#' perception, peaks are color-coded by the predicted strand.
#'
#' @name peaksToBed
#' @docType methods
#'
#' @param object A \code{\link{ChipDataSet}} object.
#' @param file \code{Character}. A file name.
#' @param strand.pred.color \code{Character}. A vector of length four, specifying
#'     colors of the predicted peak strand. The order of colors corresponds
#'     to 1) plus strand, 2) minus strand, 3) bideractional, 4) strand not
#'     determined. Default: c("blue", "red", "green4", "black").
#' @param gene.associated.peaks \code{Logical}. Whether to return gene
#'     associated peaks (based on the prediction) only. Default: TRUE
#'
#' @return A file in the BED format.
#'
#' @seealso
#'     \code{\link{constructCDS}}
#'     \code{\link{predictTssOverlap}}
#'     \code{\link{predictStrand}}
#'     \href{https://genome.ucsc.edu/}{UCSC genome browser}
#'     \href{https://genome.ucsc.edu/FAQ/FAQformat.html#format1}{BED}
#'
#' @examples
#' ### Load ChipDataSet object
#' data(cds)
#'
#' ### Load TranscriptionDataSet object
#' data(tds)
#'
#' ### Classify peaks on gene associated and background
#' predictTssOverlap(object = cds, feature = "pileup", p = 0.75)
#'
#' ### Predict peak 'strandedness'
#' predictStrand(cdsObj = cds, tdsObj = tds, coverage.cutoff = 5,
#' quant.cutoff = 0.1, win.size = 2500)
#'
#' # peaksToBed(object = cds, file = "peaks.bed", gene.associated.peaks = TRUE)
#'
#' @author Armen R. Karapetyan
#'
#' @export
#' @rdname peaksToBed-methods
setGeneric(name = "peaksToBed",
           def = function(object, file,
                          strand.pred.color = c("blue", "red", "green4", "black"),
                          gene.associated.peaks = TRUE)
                   standardGeneric("peaksToBed")
)

Try the transcriptR package in your browser

Any scripts or data that you put into this service are public.

transcriptR documentation built on Nov. 8, 2020, 8:12 p.m.