SOMNiBUS: Smooth modeling of bisulfite sequencing

Documented in runSOMNiBUS

#' @title Wrapper function running the smoothed-EM algorithm to estimate 
#' covariate effects and test regional association in Bisulfite 
#' Sequencing-derived methylation data
#' @description This function splits the methylation data into regions 
#' (according to different approaches) and, for each region, fits a 
#' (dispersion-adjusted) binomial regression model to regional methylation data,
#' and reports the estimated smooth covariate effects and regional p-values 
#' for the test of DMRs (differentially methylation regions). Over or under
#' dispersion across loci is accounted for in the model by the combination
#' of a multiplicative dispersion parameter (or scale parameter) and a
#' sample-specific random effect.
#' @description This method can deal with outcomes, i.e. the number of
#' methylated reads in a region, that are contaminated by known
#' false methylation calling rate (\code{p0}) and false non-methylation
#' calling rate (\code{1-p1}).
#' @description The covariate effects are assumed to smoothly vary across
#' genomic regions. In order to estimate them, the algorithm first
#' represents the functional parameters by a linear combination
#' of a set of restricted cubic splines (with dimension
#' \code{n.k}), and a smoothness penalization term
#' which depends on the smoothing parameters \code{lambdas} is also
#' added to control smoothness. The estimation is performed by an iterated 
#' EM algorithm. Each M step constitutes an outer Newton's iteration to estimate
#' smoothing parameters \code{lambdas} and an inner P-IRLS iteration to estimate
#' spline coefficients \code{alpha} for the covariate effects. 
#' Currently, the computation in the M step depends the implementation of
#' \code{gam()} in package \code{mgcv}.
#' @param dat a data frame with rows as individual CpGs appearing
#' in all the samples. The first 4 columns should contain the information of 
#' `Meth_Counts` (methylated counts), `Total_Counts` (read depths), 
#' `Position` (Genomic position for the CpG site) and `ID` (sample ID). 
#' The covariate information, such as disease status or cell type composition,
#' are listed in column 5 and onwards.
#' @param split this \code{list} must contain at least the element 
#' \code{approach} which corresponds to the partitioning approach used to split 
#' the data into independent regions. The partitioning methods 
#' available are:
#' \itemize{
#' \item \code{"region"} (partitioning based on the spacing of CpGs), 
#' \item \code{"density"} (partitioning based on CpG density),
#' \item \code{"chromatin"} (partitioning based on chromatin states),
#' \item \code{"gene"} (partitioning based on gene regions),
#' \item \code{"granges"} (partitioning based on user-specific 
#' annotations provided as a \code{GenomicRanges} object),
#' \item \code{"bed"} (partitioning based on user-specific 
#' annotations provided in a BED file).
#' }
#' This list should also contain additional parameters specific to each 
#' partitioning approach (see the documentation of each approach for details).
#' @param min.cpgs positive integer defining the minimum number of 
#' CpGs within a region for the algorithm to perform optimally. 
#' The default value is 50.
#' @param max.cpgs positive integer defining the maximum number of 
#' CpGs within a region for the algorithm to perform optimally. 
#' The default value is 2000.
#' @param n.k a vector of basis dimensions for the intercept and
#' individual covariates. \code{n.k} specifies an upper limit of the degrees of
#' each functional parameters. The length of n.k should equal to the number of 
#' covariates plus 1 (for the intercept)). We recommend basis dimensions n.k, 
#' approximately equal to the number of unique CpGs in the region divided by 20.
#' This parameter will be computed automatically, when several regions are 
#' generated by the partitioning function.
#' @param p0 the probability of observing a methylated read when the underlying
#' true status is unmethylated. \code{p0} is the rate of false methylation 
#' calls, i.e. false positive rate.
#' @param p1 the probability of observing a methylated read when the underlying 
#' true status is methylated. \code{1-p1} is the rate of false non-methylation 
#' calls, i.e. false negative rate.
#' @param covs a vector of covariate names. The covariates with names in 
#' \code{covs} will be included in the model and their covariate effects will be
#' estimated. The default is to fit all covariates in \code{dat}
#' @param Quasi whether a Quasi-likelihood estimation approach will be used; 
#' in other words, whether a multiplicative dispersion is added in the model 
#' or not.
#' @param epsilon numeric; stopping criterion for the closeness of estimates of
#' spline coefficients from two consecutive iterations.
#' @param epsilon.lambda numeric; stopping criterion for the closeness of 
#' estimates of smoothing parameter \code{lambda} from two consecutive 
#' iterations.
#' @param maxStep the algorithm will step if the iteration steps exceed 
#' \code{maxStep}.
#' @param binom.link the link function used in the binomial regression model; 
#' the default is the logit link.
#' @param method the method used to estimate the smoothing
#' parameters. The default is the 'REML' method which is generally better than
#' prediction based criterion \code{GCV.cp}.
#' @param RanEff whether sample-level random effects are added or not
#' @param reml.scale whether a REML-based scale (dispersion) estimator is used.
#' The default is Fletcher-based estimator.
#' @param scale negative values mean scale parameter should be estimated; if a 
#' positive value is provided, a fixed scale will be used.
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return This function returns a \code{list} of models (one by independent
#' region) including objects:
#' \itemize{
#' \item \code{est}: estimates of the spline basis coefficients \code{alpha}
#' \item \code{lambda}: estimates of the smoothing parameters for each 
#' functional parameters
#' \item \code{est.pi}: predicted methylation levels for each row in the input
#' \code{data}
#' \item \code{ite.points}: estimates of \code{est}, \code{lambda} at each EM
#' iteration
#' \item \code{cov1}: estimated variance-covariance matrix of the basis
#' coefficients \code{alphas}
#' \item \code{reg.out}: regional testing output obtained using Fletcher-based
#' dispersion estimate; an additional 'ID' row would appear if RanEff is TRUE
#' \item \code{reg.out.reml.scale}: regional testing output obtained using 
#' REML-based dispersion estimate;
#' \item \code{reg.out.gam}: regional testing output obtained using
#' (Fletcher-based) dispersion estimate from mgcv package;
#' \item \code{phi_fletcher}: Fletcher-based estimate of the (multiplicative)
#' dispersion parameter;
#' \item \code{phi_reml}: REML-based estimate of the (multiplicative)
#' dispersion parameter;
#' \item \code{phi_gam}: Estimated dispersion parameter reported by mgcv;
#' \item \code{SE.out}: a matrix of the estimated pointwise Standard Errors
#' (SE); number of rows are the number of unique CpG sites in the input data and
#' the number of columns equal to the total number of covariates fitted in the
#' model (the first one is the intercept);
#' \item \code{SE.out.REML.scale}: a matrix of the estimated pointwise Standard
#' Errors (SE); the SE calculated from the REML-based dispersion estimates
#' \item \code{uni.pos}: the genomic postions for each row of CpG sites in the
#' matrix \code{SE.out};
#' \item \code{Beta.out}: a matrix of the estimated covariate effects beta(t),
#' where t denotes the genomic positions;
#' \item \code{ncovs}: number of functional paramters in the model (including
#' the intercept);
#' \item \code{sigma00}: estimated variance for the random effect if RanEff is
#' TRUE; NA if RanEff is FALSE.
#' }
#' @author  Audrey Lemaçon
#' @examples
#' #------------------------------------------------------------#
#' data(RAdat)
#' RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ])
#' outs <- runSOMNiBUS(
#'   dat=RAdat.f, split = list(approach = "region", gap = 1e6), min.cpgs = 5,
#'   n.k = rep(5,3), p0 = 0.003, p1 = 0.9
#' )
#'
#' @export
runSOMNiBUS <- function(dat, split = list(approach = "region"),
                        min.cpgs = 50, max.cpgs = 2000,
                        n.k, p0 = 0.003, p1 = 0.9, Quasi = TRUE,
                        epsilon = 10^(-6), epsilon.lambda = 10^(-3), 
                        maxStep = 200, binom.link = "logit", method = "REML", 
                        covs = NULL, RanEff = TRUE,
                        reml.scale = FALSE, scale = -2, verbose = TRUE) {
  t0 <- Sys.time()
  msg <- paste("Processing methylation data.")
  if(verbose) Message(msg)
  
  dat <- data.frame(dat)
  if (is.factor(dat$Position)) {
    ## message('The Position in the data set should be numeric other than a
    ## factor')
    dat$Position <- as.numeric(as.character(dat$Position))
  }
  if (any(!c("Meth_Counts", "Total_Counts", 
             "Position", "ID") %in% colnames(dat))) {
    Error(paste("Please make sure object \"dat\" have",
                "columns named as \"Meth_Counts\",",
                "\"Total_Counts\", \"Position\"",
                "and \"ID\" "))
  }
  
  # filter out "bad data"
  bad_dat_idx <- which(dat$Total_Counts == 0)
  if(length(bad_dat_idx) > 0){
    dat <- dat[-bad_dat_idx,]
    msg <- paste("Remove",length(bad_dat_idx),
                 "rows with Total_Counts equal to 0.")
    if(verbose) Message(msg)
  }
  
  ## test if the split list contains the 'approach' exists
  if(!"approach" %in% names(split)){
    Warning(paste("Missing partitioning method. We will used",
                  "the default approach \"region\"."))
    split$approach <- "region"
  }
  
  regions <- switch(split$approach,
                    region = splitDataByRegion(dat = dat, 
                                               gap = split$gap,
                                               min.cpgs = min.cpgs,
                                               max.cpgs = max.cpgs,
                                               verbose = verbose),
                    density = splitDataByDensity(dat = dat, 
                                                 window.size = split$window.size, 
                                                 min.density = split$min.density, 
                                                 by = split$by,
                                                 gap = split$gap, 
                                                 min.cpgs = min.cpgs,
                                                 max.cpgs = max.cpgs,
                                                 verbose = verbose),
                    chromatin = splitDataByChromatin(dat = dat, 
                                                     chr = split$chr,
                                                     cell.line = split$cell.line,
                                                     states = split$states,
                                                     gap = split$gap,
                                                     min.cpgs = min.cpgs,
                                                     max.cpgs = max.cpgs,
                                                     verbose = verbose),
                    gene = splitDataByGene(dat = dat, 
                                           chr = split$chr,
                                           organism = split$organism,
                                           build = split$build,
                                           types = split$types,
                                           gap = split$gap,
                                           min.cpgs = min.cpgs,
                                           max.cpgs = max.cpgs,
                                           verbose = verbose),
                    granges = splitDataByGRanges(dat = dat, 
                                                 chr = split$chr,
                                                 annots = split$annots,
                                                 gap = split$gap,
                                                 min.cpgs = min.cpgs,
                                                 max.cpgs = max.cpgs,
                                                 verbose = verbose),
                    bed = splitDataByBed(dat = dat,
                                         chr = split$chr,
                                         bed = split$bed,
                                         gap = split$gap,
                                         min.cpgs = min.cpgs,
                                         max.cpgs = max.cpgs,
                                         verbose = verbose),
                    {
                      Warning(paste("Unknown partitioning",
                                    "method. We will used",
                                    "the default approach",
                                    "\"region\"."))
                      splitDataByRegion(dat = dat, min.cpgs = min.cpgs)
                    }
  )
  
  msg <- paste("Input partitioned into",length(regions), 
               "independent region(s)")
  if(verbose) Message(msg, step = NULL)
  
  results <- lapply(X = names(regions), function(n){
    
    region <- regions[[n]]
    
    msg <- paste("Starting analysis for",n, 
                 paste0("(",length(unique(region$Position))," unique CpGs)"))
    if(verbose) Message(msg, step = NULL)
    
    if(length(regions) > 1){
      n.k_dim1 <- max(1, round(length(unique(region$Position))/20))
      n.k_dim2 <- ncol(region) - 3
      n.k <- rep(n.k_dim1, n.k_dim2)
      msg <- paste0("n.k parameter set to rep(", n.k_dim1,",",n.k_dim2,")")
      if(verbose) Message(msg, step = NULL)
    }
    
    binomRegMethModel(data = region, n.k = n.k, p0 = p0, p1 = p1, 
                      Quasi = Quasi, epsilon = epsilon, 
                      epsilon.lambda = epsilon.lambda,
                      maxStep = maxStep, binom.link = binom.link, 
                      method = method, covs = covs, RanEff = RanEff, 
                      reml.scale = reml.scale, scale = scale, verbose = verbose)
  })
  
  names(results) <- names(regions)
  
  msg <- paste("Process completed in",
               format(Sys.time() - t0, digits = 2))
  if(verbose) Message(msg, step = "Finished")
  
  return(results)
}