R/metagene.R

#' A class to manage metagene analysis.
#'
#' This metagene2 class encapsulates all of the steps necessary to perform
#' metagene analyses, which are aggregations of coverages over multiple regions 
#' (genes) to reveal patterns that might not be apparent from looking at
#' individual regions. It will allow to load, convert and normalize bam 
#' alignments and regions files/data. Once the data is ready, the user can then
#' choose to produce metagene plots on the data or some subset of it.
#'
#' Most metagene analyses are a two-step affair:
#' \enumerate{
#'  \item Initialize the object using \code{mg = metagene2$new()}, specifying 
#'        which regions and bam files should be used for the analysis.
#'  \item Generate a metagene plot using \code{mg$produce_metagene}, specifying 
#'        any additional parameter (Number of bins, facetting variables, etc).
#' }
#'
#' The \code{metagene2} object will then internally chain all 6 required  
#' processing steps, updating its internal caches along the way:
#' \enumerate{
#'  \item Coverages are inferred from bam files (\code{metagene2$new}).
#'  \item Coverages from multiple bam files are grouped and normalized 
#'       (\code{mg$group_coverages}).
#'  \item Coverages are binned together (\code{mg$bin_coverages})
#'  \item Binned coverages are split according to the type of region they belong 
#'        to (\code{mg$split_coverages_by_regions}).
#'  \item Coverage means and confidence intervals are calculated for each 
#'        region * group combination (\code{mg$calculate_ci}).
#'  \item Metadata is added to the calculated coverages ( 
#'        \code{mg$add_metadata}).
#'  \item The metagene is plotted (\code{mg$plot}).
#' }
#'
#' Each of these steps has an associated function, which takes as input certain 
#' parameters of the metagene analysis and returns an intermediary structure
#' of interest (coverages, binned coverages, long-form data frame of confidence 
#' intervals, etc). Those are described below, in the "Processing methods" 
#' section.
#'
#' All processing methods automatically call previous processing steps if
#' those have not already been run. For example, there is no need to call
#' \code{mg$group_coverages()} before calling \code{mg$bin_coverages()}: the
#' metagene2 object will automatically detect that certain prerequisite steps 
#' have not yet been performed, and run them.
#' 
#' Additionally, when calling produce_metagene a second time to change certain
#' analysis parameters after generating an initial metagene plot, only the 
#' required caches are reset: all non-impacted aspects of the analysis are left 
#' untouched, decreasing processing time.
#'
#' For further examples, see the metagene2 vignette.
#'
#' @section Constructor:
#'
#' \strong{Usage:}
#'
#'  \code{mg <- metagene2$new(regions, bam_files, padding_size = 0,
#'                            cores = SerialParam(), verbose = FALSE,
#'                            force_seqlevels = FALSE, paired_end = FALSE,
#'                            assay = 'chipseq', strand_specific=FALSE,
#'                            paired_end_strand_mode=2,
#'                            region_mode="auto", region_metadata=NULL, 
#'                            extend_reads=0, invert_strand=FALSE, ...)}
#'
#' \strong{Description:}
#'
#' This method returns a new \code{metagene2} object. Upon initialization, a  
#' \code{metagene2} object calculates coverages over all given regions in the 
#' provided bam files. Any and all parameter associated with any of the 
#' processing steps can be initialized upon object construction. All analysis  
#' parameters that are not explicitly specified in the constructor call are  
#' initialized to sensible defaults.
#'
#' \strong{Parameters:}
#' \describe{
#'    \item{regions}{A description of all regions over which metagenes will be calculated.
#'
#'                   When \code{region_mode} is "separate", those should be provided using
#'                   a \code{GRanges} object representing all individual, contiguous regions
#'                   to be examined.
#'
#'                   When \code{region_mode} is "stitch", those should be provided using a 
#'                   \code{GRangesList} object where each individual \code{GRanges} element
#'                   represents a set of regions to be stitched together.
#'
#'                   As a convenience, in "separate" mode, \code{metagene2} will convert any
#'                   passed in \code{GRangesList} into an unlisted \code{GRanges} with an 
#'                   additional \code{region_name} metadata column containing 
#'                   the name of the \code{GRangesList} element it was extracted 
#'                   from.
#'
#'                   Also as a convenience, \code{regions} can also be a \code{character} 
#'                   \code{vector} of filenames, which are then imported into a GRangesList.
#'                   Supported file formats are BED, narrowPeak, broadPeak, gff and gtf.}
#'    \item{bam_files}{A \code{vector} of BAM filenames. The BAM files must be
#'                    indexed. i.e.: if a file is named file.bam, there must
#'                    be a file named file.bam.bai or file.bai in the same 
#'                    directory. If \code{bam_files} is a named vector, then the provided names
#'                    can be used downstream to refer to those bam files. If no
#'                    names are provided, \code{metagene2} will try to infer appropriate ones.}
#'    \item{assay}{\code{'chipseq'}, \code{'rnaseq'} or NULL. If non-NULL, metagene will
#'                 set other parameters, such as region_mode and strand_specific, to logical
#'                 values for the given assay. Default: \code{'chipseq'}}
#'    \item{region_mode}{Set the way the \code{regions} parameter is interpreted. Can be 
#'                       \code{'separate'}, \code{'stitch'} or \code{'auto'}. In separate mode,
#'                       \code{regions} is expected to be a GRanges defining individual, contiguous regions. In
#'                       \code{'stitch'} mode, \code{regions} is expected to be a GRangesList 
#'                       where each \code{GRanges} element represents a set of regions to be 
#'                       stitched together and treated as a single logical region. If \code{'auto'}
#'                       then a logical value is inferred from the \code{assay} parameter.
#'                       Default: \code{'auto'}}
#'    \item{region_metadata}{A data-frame of metadata to be associated with the elements of \code{regions}.
#'                           It must contain has many rows as there are elements in \code{regions}. If 
#'                           \code{region_metadata} is NULL but \code{regions} has an mcols element,
#'                           then it is used.}
#'    \item{padding_size}{The provided \code{regions} will be extended on each side by the
#'                        value of this parameter. The padding_size must be a
#'                        non-negative integer. Default = 0.}
#'    \item{cores}{The number of cores available to parallelize the analysis.
#'                Either a positive integer or a \code{BiocParallelParam}.
#'                Default: \code{SerialParam()}.}
#'    \item{verbose}{Print progression of the analysis. A logical constant.
#'                    Default: \code{FALSE}.}
#'    \item{force_seqlevels}{If \code{TRUE}, remove regions that are not found
#'                in bam file header. Default: \code{FALSE}. \code{TRUE} and \code{FALSE}
#'                respectively correspond to pruning.mode = "coarse"
#'                and "error" in ?seqinfo.}
#'    \item{paired_end}{Set this to \code{TRUE} if the provided bam files
#'                      describe paired-end reads. If \code{FALSE}, single-ended
#'                      data are expected. Default: \code{FALSE}}
#'    \item{strand_specific}{If \code{TRUE}, only reads which align to the same 
#'                           strand as those specified in \code{regions} will
#'                           count toward coverage for that region. Useful for RNA-seq
#'                           profiles generated from strand-specific libraries, such
#'                           as Illumina TruSeq. Default: \code{'FALSE'}}
#'    \item{paired_end_strand_mode}{\code{'1'} or \code{'2'}. In paired-end mode,
#'                                  indicates which read in a pair sets the pair's strand.
#'                                  If \code{1}, this is the first read (This should be used
#'                                  with directional protocols such as Directional Illumina 
#'                                  (Ligation) or Standard SOLiD).
#'                                  If \code{2}, this is the second read (This should be used
#'                                  with directional protocols such as dUTP, NSR, NNSR, 
#'                                  or Illumina stranded TruSeq PE).
#'                                  Ignored if either paired_end or strand_specific is FALSE.
#'                                  Default: \code{'2'}}
#'    \item{extend_reads}{Extend individual reads to have a minimum length equal to this parameter.
#'                        When set to 0, no read extension occurs. This is useful for single-end
#'                        chip-seq experiments, where the length of the captured fragment is usually
#'                        longer than the sequenced read.}
#'    \item{invert_strand}{If \code{TRUE}, coverages for the given regions will be inferred 
#'                         from the coverage on the strand opposite theirs. Useful
#'                         for single-end stranded experiments which use cDNA.
#'                         This parameter is ignored if strand-specific is \code{FALSE}.}
#'    \item{...}{Additional parameters for the metagene analysis. See \code{produce_metagene}
#'               for a list of possible parameters.}
#' }
#'
#'    \code{metagene2$new} returns a \code{metagene2} object that contains the
#'        coverages for every BAM files in the regions from the \code{regions}
#'        parameter.
#'
#' @return
#' \code{metagene2$new} returns a \code{metagene2} object which contains the
#' normalized coverage values for every regions in all specified BAM files.
#'
#' @section produce_metagene():
#'
#' \strong{Usage:}
#'
#' \code{mg$produce_metagene(...)}
#'
#' \strong{Description:}
#'
#' \code{produce_metagene} is the workhorse method of the metagene2 object.
#' This method performs all of the necessary analysis steps for the production
#' of the metagene plot, and returns that plot. Any and all parameters of the 
#' metagene analysis, as documented in the individual processing steps, can be 
#' passed to \code{produce_metagene}. The metagene2 object will then determines
#' which intermediate caches would be affected by changes to those parameters,
#' invalidate them, and rerun all steps up to the plotting. This makes 
#' \code{produce_metagene} ideal for fast, iterative takes on the data.
#'
#' Below we present those parameters and a brief description of their usage.
#' Please refer to the affected processing step for a more in-depth explanation
#' of each parameter.
#'
#' \strong{Parameters:}
#' \describe{
#'    \item{design}{A \code{data.frame} that describes the grouping of the bam files
#'            into design groups. By default, each bam file is its own design group.
#'            See \code{group_coverages}.}
#'    \item{normalization}{The algorithm to use to normalize coverages,
#'                        \code{NULL} (no normalization) or "RPM". By default,
#'                        no normalization occurs. See \code{group_coverages}.}
#'    \item{design_filter}{Indices indicating which subset of design groups should 
#'                         be included in the analysis. By default, all design  
#'                         groups/bam files are included. See 
#'                         \code{group_coverages}.}
#'    \item{bin_count}{The number of bins regions should be split into. Defaults
#'                     to 100. See \code{bin_coverages}.}
#'    \item{region_filter}{The subset of regions to be kept for the analysis.
#'                         By default, all regions are kept. See \code{bin_coverages}}
#'    \item{split_by}{Which metadata columns should we use to split the set of 
#'                    regions into subset of interests? Defaults to "region_name",
#'                    an automatically added column. 
#'                    See \code{split_coverages_by_regions}.}
#'    \item{alpha}{The alpha level of the confidence interval estimates.
#'                 Defaults to 0.05. See \code{calculate_ci}.}
#'    \item{sample_count}{The number of draws to perform in the bootstrap
#'                        calculations used to calculate the confidence inteval.  
#'                        Defaults to 1000. See \code{calculate_ci}}
#'    \item{resampling_strategy}{The resampling strategy to be used when performing the
#'                               bootstrap analysis, which can be either \code{'profile'}
#'                               or \code{'bin'}. Defaults to \code{'bin'}. See
#'                               \code{calculate_ci}.}
#'    \item{design_metadata}{A data-frame containing metadata for the design groups.
#'                           By default, no metadata is associated. See 
#'                           \code{add_metadata}.}
#'    \item{title}{A title to add to the graph. See \code{plot}.}
#'    \item{x_label}{X-axis label for the metagene plot. See \code{plot}.}
#'    \item{facet_by}{A formula to be used for facetting the metagene plot. 
#'                    By default, no facetting occurs. See \code{plot}.}
#'    \item{group_by}{The metadata column used to build the color scale. By 
#'                    default, the combination of design and region name is 
#'                    used. See \code{plot}.}
#' }
#'
#' @section Processing methods:
#' 
#' Each of the following methods perform one step of metagene processing.
#' Most do not need to be called explicitly. Instead, you can simply call 
#' \code{produce_metagene}. However, you can use them to access intermediary
#' results: grouped coverages, binned coverages, split coverages, and long-form
#' data-frame of coverages with confidence intervals.
#'
#' @section group_coverages:
#'
#' \strong{Usage:}
#'
#'  \code{mg$group_coverages(design=NA, normalization=NA, 
#'            design_filter=NA, simplify=FALSE)}
#'
#' \strong{Description:}
#'
#'  This method normalizes genome-wide coverages, then groups
#'  them according to the specified design groups. It returns
#'  a list of possible read orientations (+, -, *), each element
#'  of which is either NULL (depending on the value of the 
#'  strand_specific parameter) or a list of possible design groups.
#'  In turn, the lists of design groups contain lists of \code{Rle}
#'  objects representing coverage over a specific chromosome or sequence.
#'
#' \strong{Parameters:}
#' \describe{
#'    \item{design}{A \code{data.frame} that describes the grouping of the bam files
#'            into design groups. The first column of the design should contain the 
#'            names of bam_files passed on initialization. Each subsequent columns
#'            represents a design group, that is to say a combination of bam files
#'            whose coverages should be grouped together into a logical unit.
#'            These columns should contain integer values indicating whether the 
#'            bam files on that row should be excluded (0), included as an`
#'            "input" (1) or included as a "control" (2) within the specified 
#'            design group. Control samples are used for "log2_ratio"
#'            normalization, but are ignored for no or "RPM" normalization.
#'            \code{NA} can be used keep previous design value. Default: \code{NA}.}
#'    \item{normalization}{The algorithm to use to normalize coverages. Possible
#'                        values are \code{NULL} (no normalization), "RPM" and "log2_ratio".
#'                        "RPM" transforms raw counts into Reads-Per-Million.
#'                        "log2_ratio" uses the formula log2((input RPM + 1) / (control RPM + 1))
#'                        to calculate a log-ratio between input and control. \code{NA} can 
#'                        be used keep the previous value. Default: \code{NA}}
#'    \item{design_filter}{A logical vector specifying which of the design groups specified
#'                         within the \code{design} parameter should be included in the metagene.
#'                         Useful for quickly reprocessing a subset of samples.
#'                         \code{NA} can be used keep previous design value. Default: \code{NA}}
#'    \item{simplify}{In single strand mode, set \code{simplify} to \code{TRUE} to return 
#'                    only the '*' coverage and omit the empty '+' and '-' components.
#'                    Default: \code{FALSE}}
#' }
#' @section bin_coverages:
#'
#' \strong{Usage:}
#'
#'  \code{mg$bin_coverages(bin_count=NA, region_filter=NA)}
#'
#' \strong{Description:}
#'
#' This method summarizes the coverage over regions of interests
#' into a specified number of bins. For each design group, it 
#' produces a matrix of binned coverages where each row represents a region,
#' and each column represents a bin. Those are returned in a named list where
#' each element contains the resulting matrix for a specific design group.
#'
#' \strong{Parameters:}
#' \describe{
#'    \item{bin_count}{The number of bins regions should be split into. The specified 
#'                     bin_count must always be equal or higher than the minimum size of
#'                     the specified regions. \code{NA} can be used to keep the previous
#'                     value. Default: \code{NA}.}
#'    \item{region_filter}{This parameter defines the subset of regions within the \code{regions}
#'                         parameter passed on initialization on which the metagene
#'                         should be generated. \code{region_filter} can be (1) a quosure, to be evaluated
#'                         in the context of the \code{region_metadata} data-frame, (2) a character
#'                         vector containing the names of the regions to be used or (3) a logical or numeric
#'                         vector to be used for subsetting. \code{NA} can be used to keep the previous
#'                         value. Default: \code{NA}}
#' }
#' @section split_coverages_by_regions:
#'
#' \strong{Usage:}
#'
#'  \code{mg$split_coverages_by_regions(split_by=NA)}
#'
#' \strong{Description:}
#'
#' This methods splits the matrices generated by mg$bin_coverages
#' into groups of regions where the values of the metadata columns
#' specified by \code{split_by} are homogeneous. It returns a list
#' where each element represents a design group: each of those
#' element is in turn a list representing groups of regions for which
#' all metadata values specified by "split_by" are equal. The leaf elements
#' of this list hierarchy are coverage matrices where each row represents a
#' region, and each column represents a bin.
#'
#' \strong{Parameters:}
#' \describe{
#'    \item{split_by}{A vector of column names from the region_metadata
#'                    parameter, as specified on metagene initialization. The 
#'                    selected columns must allow conversion into a factor.
#'                    By default, this is set to region_name, a metadata column
#'                    which is automatically generated by metagene. \code{NA} can 
#'                    be used to keep the previous value. Default: \code{NA}}
#' }
#' @section calculate_ci:
#'
#' \strong{Usage:}
#'
#'  \code{mg$calculate_ci(alpha = NA, sample_count = NA, resampling_strategy=NA)}
#'
#' \strong{Description:}
#'
#' This method calculates coverage means and confidence intervals for all 
#' design_group * region * bin combination. These are returned as a long-form
#' data-frame.
#'
#' \strong{Parameters:}
#' \describe{
#'    \item{alpha}{The alpha level of the confidence interval estimate.
#'                \code{NA} can be used to keep the previous value. 
#'                Default: \code{NA}}
#'    \item{sample_count}{The number of draws to perform in the bootstrap
#'                        calculations used to calculate the confidence inteval.  
#'                        \code{NA} can be used to keep the previous value. Default: \code{NA}}
#'    \item{resampling_strategy}{The resampling strategy to be used when performing the
#'                               bootstrap analysis, which can be either \code{'profile'}
#'                               or \code{'bin'}. In \code{'profile'} mode, whole profiles
#'                               across all bins are resampled. In \code{'bin'} mode,
#'                               each bin is resampled individually and independantly from
#'                               all others. \code{NA} can be used to keep the previous value.
#'                               Default: \code{NA}}
#' }
#' @section add_metadata:
#'
#' \strong{Usage:}
#'
#'  \code{mg$add_metadata(design_metadata=NA)}
#'
#' \strong{Description:}
#'
#' This method adds design group and region metadata to the data-frame
#' produced by \code{mg$calculate_ci} for easier plotting.
#'
#' \strong{Parameters:}
#' \describe{
#'    \item{design_metadata}{A data-frame containing metadata for the design groups.
#'                           It must contain as many rows as there are design groups,
#'                           and must contain at least one column named 'design'
#'                           which is used to match the rows to design groups.}
#' }
#' @section plot:
#'
#' \strong{Usage:}
#'
#'  \code{mg$plot(region_names = NULL, design_names = NULL,
#'                title = NA, x_label = NA, facet_by=NA, group_by=NA)}
#'
#' \strong{Description:}
#'
#' This method produces a ggplot object giving a graphical representation
#' of the metagene analysis.
#'
#' \strong{Parameters:}
#' \describe{
#'    \item{region_names}{The names of the regions to be plotted. If \code{NULL},
#'                        all the regions are plotted. Default: \code{NULL}.}
#'    \item{design_names}{The names of the design groups to be plotted. If 
#'                        \code{NULL}, all the design groups are
#'                        plotted. Default: \code{NULL}.}
#'    \item{title}{A title to add to the graph. \code{NA} can be used to keep 
#'                 the previous value. Default: \code{NA}}
#'    \item{x_label}{X-axis label for the metagene plot. \code{NA} can be 
#'                   used to keep the previous value. Default: \code{NA}.}
#'    \item{facet_by}{A formula to be used for facetting the metagene plot. This
#'                    formula can include any design metadata, or region_metadata 
#.                    columns that were part of the \code{split_by} argument.
#'                    \code{NA} can be used to keep the previous value.
#'                    Default: \code{NA}.}
#'    \item{group_by}{A string representing a single column from design_metadata or region_metadata
#'                    which will be used to group observations together into lines and which will
#'                    be used to generate the color scale.
#'                    \code{NA} can be used to keep the previous value.
#'                    Default: \code{NA}.}
#' }
#' @section Getter methods:
#' The following methods return various informations about the metagene object.
#'
#' \strong{mg$get_params()}
#' \describe{
#'    \item{}{Returns a list of all parameters used to perform this metagene analysis.}
#' }
#'
#' \strong{mg$get_design()}
#' \describe{
#'    \item{}{Returns the design used to perform this metagene analysis.}
#' }
#'
#' \strong{mg$get_regions()}
#' \describe{
#'    \item{}{Returns the regions used for this metagene analysis.}
#' }
#'
#' \strong{mg$get_data_frame(region_names = NULL, design_names = NULL)}
#' \describe{
#'    \item{}{Returns full data-frame of results.}
#'    \item{region_names}{The names of the regions to extract. If \code{NULL},
#'                        all the regions are returned. Default: \code{NULL}.}
#'    \item{design_names}{The names of the design groups to extract. If \code{NULL},
#'                        design groups are returned. Default: \code{NULL}.}
#' }
#'
#' \strong{mg$get_plot()}
#' \describe{
#'    \item{}{Returns the ggplot object generated by the \code{metagene2$plot} function.}
#' }
#'
#' \strong{mg$get_raw_coverages()}
#' \describe{
#'    \item{}{Returns raw coverages over the regions specified on initialization.}
#'}
#'
#' \strong{mg$get_normalized_coverages()}
#' \describe{
#'    \item{}{Returns normalized coverages over the regions specified on initialization.}
#'}
#'
#' @examples
#' mg <- metagene2$new(regions = get_demo_regions(), bam_files = get_demo_bam_files())
#' \dontrun{
#'    mg$plot()
#' }
#'
#' @import GenomicRanges
#' @import BiocParallel
#' @importFrom dplyr filter
#' @importFrom dplyr mutate
#' @importFrom dplyr pull
#' @importFrom R6 R6Class
#' @importFrom data.table data.table
#' @importFrom tools file_path_sans_ext
#' @importFrom rtracklayer import
#' @importFrom purrr map
#' @importFrom magrittr "%>%"
#' @export
#' @format A metagene experiment manager
metagene2 <- R6Class("metagene2",
    public = list(
    # Methods
        initialize = function(regions, bam_files, padding_size = 0,
                              cores = SerialParam(), verbose = FALSE,
                              force_seqlevels = FALSE, paired_end = FALSE,
                              assay = 'chipseq', strand_specific=FALSE,
                              paired_end_strand_mode=2,
                              region_mode="auto", region_metadata=NULL, 
                              extend_reads=0, invert_strand=FALSE, ...) {

            # Validate the format of bam_files, since it is used to preprocess certain
            # parameters before initialization.
            validate_bam_files_format(bam_files)
            validate_assay(assay)
            
            if(region_mode=="auto") {
                region_mode = ifelse(assay=='rnaseq', "stitch", "separate")
            }
            
            if(assay=='rnaseq') {
                strand_specific=TRUE
            }
            
            # Define default design.
            default_design = private$get_complete_design(bam_files)
            design_metadata = data.frame(design=as.character(default_design[,1]), stringsAsFactors=FALSE)
            
            # Initialize parameter handler.
            private$ph <- parameter_manager$new(
                param_values=list(
                    design=default_design,
                    design_metadata=design_metadata,
                    bam_files=private$name_from_path(bam_files),
                    padding_size=padding_size,
                    verbose=verbose,
                    force_seqlevels=force_seqlevels,
                    paired_end=paired_end,
                    assay=tolower(assay),
                    strand_specific=strand_specific,
                    paired_end_strand_mode=paired_end_strand_mode,
                    normalization=NULL,
                    avoid_gaps=FALSE,
                    gaps_threshold=0,
                    bin_count=100,
                    alpha=0.05,
                    sample_count=1000,
                    region_mode=region_mode,
                    split_by="region_name",
                    region_filter=TRUE,
                    design_filter=TRUE,
                    resampling_strategy="bin",
                    facet_by=NULL,
                    group_by=NULL,
                    title=NULL,
                    x_label=NULL,
                    extend_reads=extend_reads,
                    invert_strand=invert_strand),
                param_validations=list(
                    design=private$validate_design,
                    bam_files=validate_bam_files,
                    padding_size=validate_padding_size,
                    verbose=validate_verbose,
                    force_seqlevels=validate_force_seqlevels,
                    assay=validate_assay,
                    normalization=validate_normalization,
                    avoid_gaps=validate_avoid_gaps,
                    gaps_threshold=validate_gaps_threshold,
                    bin_count=validate_bin_count,
                    alpha=validate_alpha,
                    sample_count=validate_sample_count,
                    region_mode=validate_region_mode,
                    extend_reads=validate_extend_reads,
                    split_by=validate_split_by),
                overall_validation=validate_combination)
                
            # Update any other parameter passed as a ... argument.
            # Since the parameter manager is locked, any non-existant
            # parameter will cause an error.
            private$ph$update_params(...)
                
            # Prepare objects for parralel processing.
            self$set_cores(cores)
            
            # Prepare bam files
            bm = private$start_bm("Prepare bam files")
            private$bam_handler <- Bam_Handler$new(private$ph$get("bam_files") , cores = cores,
                                        paired_end = paired_end,
                                        strand_specific=strand_specific,
                                        paired_end_strand_mode=paired_end_strand_mode,
                                        extend_reads=private$ph$get("extend_reads"),
                                        invert_strand=private$ph$get("invert_strand"))
            private$stop_bm(bm)

            # Prepare regions
            private$print_verbose("Prepare regions...")
            private$regions <- private$prepare_regions(regions, private$ph$get("region_mode"), region_metadata)

            # Parse bam files
            bm = private$start_bm("Producing coverage")
            private$coverages <- private$produce_coverages()
            private$stop_bm(bm)            
        },
        get_bam_count = function(filename) {
            # Parameters validation are done by Bam_Handler object
            private$bam_handler$get_aligned_count(filename)
        },
        get_params = function() {
            private$ph$get_all()
        },
        get_design = function() {
            private$ph$get("design")
        },
        get_design_group_names = function() {
            private$get_design_names_internal(private$ph$get('design'))
        },        
        get_regions = function() {
            return(private$regions)
        },
        get_regions_metadata = function() {
            return(private$region_metadata)
        },
        get_split_regions = function() {
            return(split_regions(private$regions,
                                 private$region_metadata, 
                                 private$ph$get("split_by")))
        },
        get_data_frame = function(region_names = NULL, design_names = NULL) {
            if(!is.null(private$ci_meta_df)) {
                results = private$ci_meta_df
            } else if(!is.null(private$ci_df)) {
                results = private$ci_df
            } else {
                return(NULL)
            }
            
            region_subset = TRUE
            if (!is.null(region_names)) {
                region_subset = results$region %in% region_names
            }
            
            design_subset = TRUE
            if (!is.null(design_names)) {
                design_subset = results$design %in% design_names
            }
            return(results[region_subset & design_subset,,drop=FALSE])
        },
        get_plot = function() {
            private$graph
        },
        get_raw_coverages = function() {
            if(!private$ph$get('strand_specific')) {
                return(private$get_raw_coverages_internal()[['*']])
            } else {
                return(private$get_raw_coverages_internal())
            }        
        },
        get_normalized_coverages = function() {
            if(!private$ph$get('strand_specific')) {
                return(private$get_normalized_coverages_internal()[['*']])
            } else {
                return(private$get_normalized_coverages_internal())
            }
        },
        set_cores = function(cores) {
            validate_cores(cores)
            if(is.null(private$parallel_job)) {
                private$parallel_job <- Parallel_Job$new(cores)
            } else {
                private$parallel_job$set_core_count(cores)
            }
        },
        group_coverages = function(design=NA, normalization=NA, design_filter=NA, simplify=TRUE) {
            # Clean up the design so it'll have the expected format.
            design = private$clean_design(design, private$ph$get("bam_files"))

            if(private$ph$have_params_changed(design)) {
                # design has changed.
                # Generate a new design metadata.
                private$ph$set("design_metadata", data.frame(design=design[,1]))
            }
            
            private$update_params_and_invalidate_caches(design, normalization, design_filter)
            
            if(is.null(private$grouped_coverages)) {
                bm <- private$start_bm("Grouping and normalizing coverages")
                
                # Identify design subset.
                design_col_to_keep = rep_len(private$ph$get("design_filter"), ncol(private$ph$get("design")) - 1)
                design_col_to_keep = 1 + which(design_col_to_keep)
                design_subset = private$ph$get("design")[, c(1, design_col_to_keep)]
                
                # Determine how data is to be merged
                if(is.null(private$ph$get("normalization"))) {
                    merge_operation = "+"
                } else if(private$ph$get("normalization")=="RPM") {
                    merge_operation = "mean"
                } else if(private$ph$get("normalization")=="log2_ratio") {
                    merge_operation = "log2_ratio"
                } else {
                    stop("Unsupported normalization value.")
                }
                
                private$grouped_coverages = lapply(private$get_coverages_internal(),
                                                   group_coverages_s,
                                                   design=design_subset,
                                                   bam_handler=private$bam_handler,
                                                   merge_operation=merge_operation)
                private$stop_bm(bm)                                                   
            }

            if(!private$ph$get("strand_specific") && simplify) {
                invisible(private$grouped_coverages[["*"]])
            } else {
                invisible(private$grouped_coverages)
            }
        },
        bin_coverages = function(bin_count=NA, region_filter=NA) {
            # Make sure the previous step has been performed.
            self$group_coverages()
            private$update_params_and_invalidate_caches(bin_count, region_filter)
            
            if(is.null(private$binned_coverages)) {
                bm = private$start_bm("Binning coverages")
                private$binned_coverages = bin_coverages_s(private$grouped_coverages,
                                                           regions=private$select_regions(private$ph$get("region_filter")),
                                                           bin_count=private$ph$get("bin_count"))
                private$stop_bm(bm)
            }
           
            invisible(private$binned_coverages)
        },
        split_coverages_by_regions = function(split_by=NA) {
            # Make sure the previous step has been performed.
            self$bin_coverages()
            private$update_params_and_invalidate_caches(split_by)
            
            if(is.null(private$split_coverages)) {
                bm = private$start_bm("Splitting coverages by region type")
                split_res = split_matrices(private$binned_coverages,
                                           private$select_region_metadata(private$ph$get("region_filter")),
                                           private$ph$get('split_by'))
                private$split_coverages = split_res$Matrices
                private$split_metadata_cache = split_res$Metadata
                private$stop_bm(bm)                                         
            }
            
            invisible(private$split_coverages)
        },        
        calculate_ci = function(alpha=NA, sample_count=NA, resampling_strategy=NA) {
            # Make sure the previous steps have been completed.
            self$split_coverages_by_regions()
            private$update_params_and_invalidate_caches(alpha, sample_count, resampling_strategy)

            if(is.null(private$ci_df)) {
                bm = private$start_bm("Producing data-frame")
                private$ci_df = calculate_matrices_ci(private$split_coverages,
                                                      private$ph$get('sample_count'), 
                                                      private$ph$get('alpha'),
                                                      private$ph$get('resampling_strategy'),
                                                      private$parallel_job)
                private$stop_bm(bm)
            }
            
            invisible(private$ci_df)
        },
        add_metadata = function(design_metadata=NA) {
            # Make sure the previous steps have been completed.
            self$calculate_ci()
            private$update_params_and_invalidate_caches(design_metadata)
                                                    
            if(is.null(private$ci_meta_df)) {
                filtered_design = private$ph$get("design_metadata")[private$ph$get("design_filter"),, drop=FALSE]
                private$ci_meta_df = add_metadata_to_ci(private$ci_df,
                                                        private$split_metadata_cache,
                                                        filtered_design)
            }
            
            invisible(private$ci_meta_df)
        },
        plot = function(region_names = NULL, design_names = NULL, title = NA,
                        x_label = NA, facet_by=NA, group_by=NA) {
            # 1. Get the correctly formatted table
            self$add_metadata()

            private$update_params_and_invalidate_caches(title, x_label, facet_by, group_by)
            
            plot_df <- self$get_data_frame(region_names, design_names)
            
            # 3. Produce the graph
            if (is.null(title)) {
                title <- paste(unique(plot_df[["group"]]), collapse=" vs ")
            }
            private$graph <- private$plot_graphic(df = plot_df, 
                                        title = private$ph$get("title"), 
                                        x_label = private$ph$get("x_label"),
                                        facet_by=private$ph$get("facet_by"),
                                        group_by=private$ph$get("group_by"))
            private$graph
        },
        produce_metagene = function(...) {
            private$update_params_and_invalidate_caches(...)
            self$plot()
        },
        plot_single_region = function(region, facet_by=NA, group_by=NA,
                                      no_binning=FALSE) {
            # Clone the mg object
            single_mg = self$clone(deep=TRUE)
            
            # Select the one region to plot, and make sure it ends up as a single GRange object.
            single_region = private$select_regions(region)
            stopifnot(length(single_region)==1)
            if(is(self$get_regions(), "GRangesList")) {
                single_region = single_region[[1]]
            }
            
            # If we are skipping binning, make the number of bins equal
            # the length in nucleotide.
            if(no_binning) {
                bin_count = sum(width(single_region))
            } else {
                bin_count = private$ph$get("bin_count")
            }
            
            # Re-bin with the new bin_count and the new filter.
            single_mg$bin_coverages(bin_count=bin_count, region_filter=region)
            
            # With a single region, splitting cannot be applied, so
            # we'll pass in a single row-name that we know to be valid.
            single_mg$split_coverages_by_regions(split_by="region_name")
            
            # Produce the new base single plot.
            out_plot = single_mg$plot(facet_by=facet_by, group_by=group_by)
            
            # Adjust the plot if binning was skipped.
            if(no_binning) {
                # New x label.
                out_plot <- out_plot + labs(x="Distance in nucleotides") 
                
                # If dealing with a stitched region, display "exon" boundaries as.
                # vertical lines.
                if(length(single_region) > 1) {
                    cumulative_width = 0
                    for(i in width(single_region)) {
                        cumulative_width = cumulative_width + i
                        out_plot <- out_plot + geom_vline(xintercept=cumulative_width)
                    }
                }
            }

            # Return the new plot.
            out_plot
        },
        replace_region_metadata = function(region_metadata) {
            # Validate that the old and new metadata have the same number of rows.
            if(nrow(region_metadata)!=nrow(private$region_metadata)) {
                stop("region_metadata must have one row per region.")
            }
            
            # Make sure the region_name column is still present.
            if(is.null(region_metadata$region_name)) {
                message("region_name is missing from the new metadata. Recreating it.")
                region_metadata$region_name = private$region_metadata$region_name
            }
            
            # Make sure that the split_by columns are all present and did not change.
            # If they did, invalidate everything after split_by
            for(split_column in private$ph$get("split_by")) {
                if(is.null(region_metadata[[split_column]]) ||
                   !all(region_metadata[[split_column]]==private$region_metadata[[split_column]])) {
                    # The split_by columns were removed or changed. Restore the original parameter value.
                    message("Replace region_metadata with metadata which would result in a different ",
                            "region split. All caches at the 'split_regions' step will be invalidated, ",
                            "and split_by will be reset to its default value.")
                    private$update_params_and_invalidate_caches(split_by="region_name")
                }
            }
            
            # Everything checks out, replace the metadata.
            private$region_metadata = region_metadata
        }
    ),
    private = list(
        # Region information. Both are kept separate from the parameter
        # handler since both can be very large, and making comparisons
        # to see if they've changed would be onerous.
        regions = GRangesList(),
        region_metadata = NULL,
        
        # Caches of intermediary step results.
        coverages = list(),
        grouped_coverages = NULL,
        binned_coverages = NULL,
        split_coverages = NULL,
        ci_df=NULL,
        ci_meta_df=NULL,
        graph = NULL,

        # Internal caches.
        split_metadata_cache = NULL,

        # Objects for handling bams, parameters and parallel jobs.
        bam_handler = "",
        parallel_job = NULL,
        ph=NULL,
        
        print_verbose = function(to_print) {
            if (private$ph$get("verbose")) {
                cat(paste0(to_print, "\n"))
            }
        },
        prepare_regions = function(regions, region_mode, region_metadata) {
            if (is(regions, "character")) {
                # Validation specific to regions as a vector
                if (!all(vapply(regions, file.exists, TRUE))) {
                    stop("regions is a list of files, but some of those files do not exist.")
                }
                regions = private$import_regions_from_disk(regions)
            }

            if (is(regions, "GRanges")) {
                if(region_mode=="stitch") {
                    if(length(unique(seqnames(regions))) > 1) {
                        stop(paste0("In stitch region_mode, such as in rnaseq assays, regions should be a ",
                                    "GRangesList of transcripts, or a GRanges ",
                                    " object representing a single transcript. ",
                                    "Here regions spans several seqnames, indicating ",
                                    "it might include many transcripts."))
                    }
                }
                regions <- GRangesList("regions" = regions)                
            } else if (is(regions, "list")) {
                regions <- GRangesList(regions)
            } else if (!is(regions, "GRangesList")) {
                stop(paste0("regions must be either a vector of BED ",
                            "filenames, a GRanges object or a GrangesList object"))            
            }
            
            # If regions do not have names, generate generic names for them.
            if (is.null(names(regions))) {
                names(regions) <- paste0("region_", seq_along(regions))
            }
            
            # In stitch mode, make sure disjoint regions part of the same
            # group do not overlap each other.
            if (region_mode=="stitch"){
                # If some "exons" overlap, then the total size will be smaller than the reduced size.
                total_size = sum(width(regions))
                reduced_size = sum(width(GenomicRanges::reduce(regions)))
                if(!all(total_size==reduced_size)) {
                    stop("In stitch region_mode, no overlap should exist between the individual ",
                         "GRanges making up the elements of the GRangesList")
                }
            }
            # TODO: Check if there is a id column in the mcols of every ranges.
            #    If not, add one by merging seqnames, start and end.

            # Apply padding and sortSeqLevels
            pad_regions = function(x, padding_size) {
                start(x) <- pmax(start(x) - padding_size, 1)
                end(x) <- end(x) + padding_size
                # Clean seqlevels
                x <- sortSeqlevels(x)
                x
            }
            regions = GRangesList(lapply(regions, pad_regions, padding_size = private$ph$get("padding_size")))
            
            # Add a region column to all GRangesList elements.
            regions_with_extra_col = list()
            for(region_name in names(regions)) {
                regions_with_extra_col[[region_name]] = regions[[region_name]]
                mcols(regions_with_extra_col[[region_name]])$region_name = region_name
            }
            regions = GRangesList(regions_with_extra_col)
            
            # In separate mode, simplify regions into a single GRanges object.
            if(region_mode=="separate") {
                regions = unlist(regions, use.names=FALSE)
            }

            # Build metadata from the mcols of the given regions.
            if(region_mode=="separate") {
                mcol_metadata = mcols(regions)
            } else {
                first_or_null = function(x) {
                    if(length(x)>0) {
                        return(mcols(x)[1,, drop=FALSE])
                    } else {
                        return(NULL)
                    }
                }
                mcol_metadata = do.call(rbind, lapply(regions, first_or_null))
                rownames(mcol_metadata) = names(regions)
            }

            # Merge the passed metadata object with the mcol metadata.
            if(is.null(region_metadata)) {
                private$region_metadata = mcol_metadata
            } else {
                stopifnot(nrow(region_metadata)==length(regions))
                non_duplicate_columns = setdiff(colnames(mcol_metadata), colnames(region_metadata))
                if(length(non_duplicate_columns) > 0) {
                    private$region_metadata = cbind(region_metadata, mcol_metadata[,non_duplicate_columns, drop=FALSE])
                }
            }
            
            if(!is.null(names(regions)) && is.null(rownames(private$region_metadata))) {
                rownames(private$region_metadata) = names(regions)
            }            
            
            return(regions)
        },
        produce_coverages = function() {
            if(private$ph$get("region_mode")=="stitch") {
                regions = BiocGenerics::unlist(private$regions)
            } else {
                regions = private$regions
            }
            
            regions <- GenomicRanges::reduce(regions)
            bam_files = private$ph$get("bam_files")

            res <- private$parallel_job$launch_job(
                        data = bam_files,
                        FUN = private$bam_handler$get_coverage,
                        regions = regions,
                        force_seqlevels= private$ph$get("force_seqlevels"),
                        simplify=FALSE)

            names(res) <- bam_files
            
            # Turn res inside out so that strand is at the top level,
            # and bam files on the second.
            res = list('+'=purrr::map(res, '+'),
                       '-'=purrr::map(res, '-'),
                       '*'=purrr::map(res, '*'))
            replace_nulls = function(x) { 
                if(all(purrr::map_lgl(x, is.null))) {
                    return(NULL)
                } else {
                    return(x)
                }
            }
            res = lapply(res, replace_nulls)

                       
            sortseq_or_null <- function(x) {
                if(is.null(x)) {
                    return(x)
                } else {
                    return(lapply(x, GenomeInfoDb::sortSeqlevels))
                }
            }
            
            lapply(res, sortseq_or_null)
        },
        plot_graphic = function(df, title, x_label, facet_by, group_by) {
            # Prepare x label
            assay = private$ph$get("assay")
            if (is.null(x_label)) {
                x_label <- "Distance in bins"
            }

            # Prepare y label
            y_label <- "Mean coverage"
            if (is.null(private$ph$get("normalization"))) {
                y_label <- paste(y_label, "(raw)")
            } else if(private$ph$get("normalization") == "log2_ratio") {
                y_label <- "log2((Treatment RPM + 1) / (Control RPM + 1))"
            } else {
                y_label <- paste(y_label, "(RPM)")
            }
            
            # Produce plot
            p <- plot_metagene(df, facet_by=facet_by, group_by=group_by) +
                ylab(y_label) +
                xlab(x_label) +
                ggtitle(title)
            p
        },
        get_bam_names = function(filenames) {
            if (all(filenames %in% colnames(private$ph$get("design"))[-1])) {
                filenames
            } else {
                stopifnot(private$check_bam_files(filenames))
                vapply(filenames,
                    private$bam_handler$get_bam_name,
                    character(1))
            }
        },
        check_bam_files = function(bam_files) {
            all(vapply(bam_files,
            function(x) {
                !is.null((private$bam_handler$get_bam_name(x)))
            },
            logical(1)))
        },
        get_design_names_internal = function(design) {
            if(is.null(design)) {
                return(NULL)
            } else {
                return(colnames(design)[-1])
            }
        },
        get_design_number = function(design) {
            if(is.null(design)) {
                return(NULL)
            } else {
                return(ncol(design) - 1)
            }        
        },
        get_bam_in_design = function(design, design_name) {
            private$get_x_in_design(design, design_name, 1)
        },
        get_control_in_design = function(design, design_name) {
            private$get_x_in_design(design, design_name, 2)        
        },
        get_x_in_design = function(design, design_name, value) {
            if(is.null(design)) {
                return(NULL)
            } else {
                return(design$Samples[design[[design_name]] == value])
            }        
        },
        get_bam_by_design = function(design) {
            map(private$get_design_names_internal(design), ~private$get_bam_in_design(design, .x))
        },
        get_coverage_names = function(coverages) {
            stopifnot(length(setdiff(names(coverages), c("+", "-", "*")))==0)
            if(!is.null(coverages[['+']])) {
                return(names(coverages[['+']]))
            } else if(!is.null(coverages[['-']])) {
                return(names(coverages[['-']]))
            } else if(!is.null(coverages[['*']])) {
                return(names(coverages[['*']]))
            }
        },
        get_raw_coverages_internal = function() {
            private$coverages
        },
        get_normalized_coverages_internal = function() {
            # Define a function which will normalize coverage for a single
            # BAM file.
            normalize_coverage <- function(work_item) {
                weight <- 1 / (work_item$Count / 1000000)
                return(list(Strand=work_item$Strand, 
                            BamFile=work_item$BamFile, 
                            Coverage=work_item$Coverage * weight))
            }
            
            # Get the raw coverages.
            coverages <- private$get_raw_coverages_internal()
            
            # Serialize the workload
            work_items = list()
            for(strand in c("+", "-", "*")) {
                if(!is.null(coverages[[strand]])) {
                    for(bam_file in names(coverages[[strand]])) {
                        work_items[[length(work_items) + 1]] = list(Coverage=coverages[[strand]][[bam_file]],
                                                                  Strand=strand,
                                                                  BamFile=bam_file,
                                                                  Count=private$bam_handler$get_aligned_count(bam_file))
                    }
                }
            }
            
            serialized_coverages <- private$parallel_job$launch_job(data = work_items,
                                                                    FUN = normalize_coverage)
                                                                    
            # Deserialize the coverages.
            norm_coverages = list("+"=NULL, "-"=NULL, "*"=NULL)
            for(i in serialized_coverages) {
                if(is.null(norm_coverages[[i$Strand]])) {
                    norm_coverages[[i$Strand]] = list()
                }
                
                norm_coverages[[i$Strand]][[i$BamFile]] = i$Coverage
            }

            norm_coverages
        },
        get_coverages_internal = function() {
            if (!is.null(private$ph$get("normalization"))) {
                coverages <- private$get_normalized_coverages_internal()    
            } else {
                coverages <- private$get_raw_coverages_internal()
            }
            
            return(coverages)
        },
        validate_design = function(design) {
            validate_design_format(design)
            private$validate_design_values(design)
        },
        validate_design_values = function(design) {
            # At least one file must be used in the design
            if (sum(rowSums(design[ , -1, drop=FALSE]) > 0) == 0) {
                stop("At least one BAM file must be used in the design.")
            }
        
            # Check if used bam files exist.
            non_empty_rows = rowSums(design[, -1, drop=FALSE]) > 0
            if (!all(file.exists(as.character(design[non_empty_rows,1])))) {
                warning("At least one BAM file does not exist")
            }
        },        
        get_complete_design = function(bam_files) {
            bam_files = private$name_from_path(bam_files)
            
            # Concatenate the bam names and the identity matrix, then
            # rename all columns but the first.
            design <- cbind(data.frame(bam_files = bam_files, stringsAsFactors=FALSE), diag(length(bam_files)))
            colnames(design)[-1] = names(bam_files)
            
            design
        },
        name_from_path = function(file_paths) {
            alt_names = tools::file_path_sans_ext(basename(file_paths))
            if(is.null(names(file_paths))) {
                names(file_paths) = alt_names
            } else {
                names(file_paths) = ifelse(names(file_paths)=="", alt_names, names(file_paths))
            }
            return(file_paths)
        },
        import_regions_from_disk = function(file_names) {
            file_names = private$name_from_path(file_names)
            import_file <- function(region) {
                ext <- tolower(tools::file_ext(region))
                if (ext == "narrowpeak") {
                    extraCols <- c(signalValue = "numeric",
                                    pValue = "numeric", qValue = "numeric",
                                    peak = "integer")
                    rtracklayer::import(region, format = "BED",
                                        extraCols = extraCols)
                } else if (ext == "broadpeak") {
                    extraCols <- c(signalValue = "numeric",
                                    pValue = "numeric", qValue = "numeric")
                    rtracklayer::import(region, format = "BED",
                                        extraCols = extraCols)
                } else if (ext == "gtf" | ext == "gff") {
                    gxf_regions = rtracklayer::import(region)
                    if("gene_id" %in% colnames(mcols(gxf_regions))) {
                        return(split(gxf_regions, gxf_regions$gene_id))
                    } else {
                        return(gxf_regions)
                    }
                } else {
                    rtracklayer::import(region)
                }
            }
            regions <- private$parallel_job$launch_job(data = file_names,
                                                    FUN = import_file)
            names(regions) <- names(file_names)
            return(regions)
        },
        deep_clone = function(name, value) {
            # With x$clone(deep=TRUE) is called, the deep_clone gets invoked once for
            # each field, with the name and value.
            if (name == "ph") {
                 # `a` is an environment, so use this quick way of copying
                 value$clone()
            } else {
                # For all other fields, just return the value
                value
            }
        },
        clean_design = function(design, bam_files) {
            # If no design is provided, use the default one.
            if(is.null(design)) {
                design = private$get_complete_design(bam_files)
            }
            
            # NA will be overwritten with the previous design later on.
            if(!is.data.frame(design) && is.na(design)) {
                return(NA)
            }
            
            # Make sure the design is in the correct format (data-frame
            # with at least 2 columns) before we try improving it.
            validate_design_format(design)
            
            # Standardize names used in first column to match those
            # used in bam_files.
            if(!all(as.character(design[,1]) %in% bam_files)) {
                stop("Design contains samples absent from the list of bam files provided on initialization.")
            }

            return(design)
        },
        start_bm = function(msg) {
            private$print_verbose(paste0(msg, "..."))
            #return(list(Message=msg, Time=Sys.time(), Memory=pryr::mem_used()))
            return(list(Message=msg, Time=Sys.time()))
        },
        stop_bm = function(bm_obj) {
            bm_after_time = Sys.time()
            #bm_after_mem = pryr::mem_used()            
            bm_time = difftime(bm_after_time, bm_obj$Time, unit="secs")
            #bm_mem = structure(bm_after_mem - bm_obj$Memory, class="bytes")
            
            private$print_verbose(paste0("BENCHMARK-TIME-", bm_obj$Message, ":", bm_time))
            #private$print_verbose(paste0("BENCHMARK-MEMORY-", bm_obj$Message, ":", bm_mem))        
        },
        update_params_and_invalidate_caches = function(...) {
            # This prologue makes it possible to infer parameter names from the
            # name of the variable it is passed in. This allows us to avoid
            # design=design, bin_count=bin_count repetitive code.
            #
            # It cannot be factorized into a function, since in any further call,
            # the argument list will deparse as "list(...)".
            param_names_alt = unlist(lapply( substitute(list(...)), deparse)[-1])
            arg_list = list(...)
            if(is.null(names(arg_list))) {
                names(arg_list) = param_names_alt
            } else {
                names(arg_list) = ifelse(names(arg_list)=="", param_names_alt, names(arg_list))
            }
        
            # Associate each parameter witht he step it is used in.
            param_step_map = c(design="group_coverages",
                               normalization="group_coverages", 
                               noise_removal="group_coverages", 
                               design_filter="group_coverages",
                               bin_count="bin_coverages", 
                               region_filter="bin_coverages",
                               split_by="split_coverages",
                               region_filter="split_coverages",
                               alpha="calculate_ci", 
                               sample_count="calculate_ci", 
                               resampling_strategy="calculate_ci",
                               design_metadata="add_metadata")
                               
            # Associate each step with the cache it generates,
            # in reverse order, so we can easily determine which
            # caches to invalidate when a particular step needs to be re-run.
            step_cache_map = c(add_metadata="ci_meta_df",
                               calculate_ci="ci_df",
                               split_coverages="split_coverages",
                               bin_coverages="binned_coverages",
                               group_coverages="grouped_coverages")
                               
            cache_invalidated=FALSE
            
            # Loop over all passed-in parameters.
            if(length(arg_list) > 0) {
                for(arg_index in seq_along(arg_list)) {
                    arg_name=names(arg_list)[arg_index]
                    
                    # Determine if the parameter has changed from its last value.
                    if(do.call(private$ph$have_params_changed, arg_list[arg_index])) {
                        private$print_verbose(paste0(arg_name, " has changed.\n"))
                        
                        # Determine which step the parameter belongs to.
                        invalidated_step = param_step_map[names(arg_list)[arg_index]]
                        if(!is.na(invalidated_step)) {
                            # Invalidate all caches for the step the parameter belonged to,
                            # as well as all caches for downsteam steps.
                            invalidated_caches = step_cache_map[seq_len(which(names(step_cache_map)==invalidated_step))]
                            private$print_verbose(paste0(paste0(invalidated_caches, collapse=", "), " will be invalidated.\n"))
                            for(cache in invalidated_caches) {
                                private[[cache]] = NULL
                                cache_invalidated = TRUE
                            }
                        }
                    }
                }
            
                do.call(private$ph$update_params, arg_list)
            }
            
            return(cache_invalidated)
        },
        select_region_indices = function(selector) {
            if("quosure" %in% class(selector)) {
                # Using dplyr for this because I'm not comfortable enough with
                # quosures.
                selected_indices = as.data.frame(private$region_metadata) %>%
                                       dplyr::mutate(METAGENE_IDX=seq_len(dplyr::n())) %>%
                                       dplyr::filter(!! selector) %>%
                                       dplyr::pull(METAGENE_IDX)
            } else if(is(selector, "character")) {
                if(!is.null(names(self$get_regions()))) {
                    selected_indices = selector
                } else {
                    selected_indices = self$get_regions()$region_name %in% selector
                }
            } else if(is.numeric(selector) || is.logical(selector) || is.numeric(selector)) {
                selected_indices = selector
            }
            
            selected_indices
        },
        select_regions = function(selector) {
            self$get_regions()[private$select_region_indices(selector)]
        },
        select_region_metadata = function(selector) {
            private$region_metadata[private$select_region_indices(selector),, drop=FALSE]
        }
    )
)

Try the metagene2 package in your browser

Any scripts or data that you put into this service are public.

metagene2 documentation built on March 19, 2021, 6 p.m.