R/binomRegMethModel.R

Defines functions testStat extractDesignMatrix phiFletcher fitGam binomRegMethModelInit binomRegMethModelChecks regResOneSmooth binomRegMethModelSummary estimateSE estimateVar estimateBeta estimateBZ fitEM binomRegMethModel

Documented in binomRegMethModel

#' @title A smoothed-EM algorithm to estimate covariate effects and test
#' regional association in Bisulfite Sequencing-derived methylation data
#' @description This function fits a (dispersion-adjusted) binomial
#' regression model to regional methylation data, and reports the
#' estimated smooth covariate effects and regional p-values for the
#' test of DMRs (differentially methylation regions). Over or under
#' dispersion across loci is accounted for in the model by the combination
#' of a multiplicative dispersion parameter (or scale parameter) and a
#' sample-specific random effect.
#' @description This method can deal with outcomes, i.e. the number of
#' methylated reads in a region, that are contaminated by known
#' false methylation calling rate (\code{p0}) and false non-methylation
#' calling rate (\code{1-p1}).
#' @description The covariate effects are assumed to smoothly vary across
#' genomic regions. In order to estimate them, the algorithm first
#' represents the functional parameters by a linear combination of a set of 
#' restricted cubic splines (with dimension \code{n.k}), and a smoothness 
#' penalization term which depends on the smoothing parameters \code{lambdas}
#' is also added to control smoothness. The estimation is performed by an
#' iterated EM algorithm. Each M step constitutes an outer Newton's iteration to
#' estimate smoothing parameters \code{lambdas} and an inner P-IRLS iteration to
#' estimate spline coefficients \code{alpha} for the covariate effects.
#' Currently, the computation in the M step depends the implementation of
#' \code{gam()} in package \code{mgcv}.
#' @param data a data frame with rows as individual CpGs appearing in all the
#' samples. The first 4 columns should contain the information of `Meth_Counts`
#' (methylated counts), `Total_Counts` (read depths), `Position` (Genomic
#' position for the CpG site) and `ID` (sample ID). The covariate information,
#' such as disease status or cell type composition, are listed in column 5 and
#' onwards.
#' @param n.k a vector of basis dimensions for the intercept and
#' individual covariates. \code{n.k} specifies an upper limit of the degrees of
#' each functional parameters. The length of n.k should equal to the number of 
#' covariates plus 1 (for the intercept)). We recommend basis dimensions n.k, 
#' approximately equal to the number of unique CpGs in the region divided by 20.
#' This parameter will be computed automatically, when several regions are 
#' generated by the partitioning function.
#' @param p0 the probability of observing a methylated read when the underlying
#' true status is unmethylated. \code{p0} is the rate of false methylation 
#' calls, i.e. false positive rate.
#' @param p1 the probability of observing a methylated read when the underlying 
#' true status is methylated. \code{1-p1} is the rate of false non-methylation 
#' calls, i.e. false negative rate.
#' @param covs a vector of covariate names. The covariates with names in 
#' \code{covs} will be included in the model and their covariate effects will be
#' estimated. The default is to fit all covariates in \code{dat}
#' @param Quasi whether a Quasi-likelihood estimation approach will be used; 
#' in other words, whether a multiplicative dispersion is added in the model 
#' or not.
#' @param epsilon numeric; stopping criterion for the closeness of estimates of
#' spline coefficients from two consecutive iterations.
#' @param epsilon.lambda numeric; stopping criterion for the closeness of 
#' estimates of smoothing parameter \code{lambda} from two consecutive 
#' iterations.
#' @param maxStep the algorithm will step if the iteration steps exceed 
#' \code{maxStep}.
#' @param binom.link the link function used in the binomial regression model; 
#' the default is the logit link.
#' @param method the method used to estimate the smoothing
#' parameters. The default is the 'REML' method which is generally better than
#' prediction based criterion \code{GCV.cp}.
#' @param RanEff whether sample-level random effects are added or not
#' @param reml.scale whether a REML-based scale (dispersion) estimator is used.
#' The default is Fletcher-based estimator.
#' @param scale negative values mean scale parameter should be estimated; if a 
#' positive value is provided, a fixed scale will be used.
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return This function return a \code{list} including objects:
#' \itemize{
#' \item \code{est}: estimates of the spline basis coefficients \code{alpha}
#' \item \code{lambda}: estimates of the smoothing parameters for each 
#' functional parameters
#' \item \code{est.pi}: predicted methylation levels for each row in the input
#' \code{data}
#' \item \code{ite.points}: estimates of \code{est}, \code{lambda} at each EM
#' iteration
#' \item \code{cov1}: estimated variance-covariance matrix of the basis
#' coefficients \code{alphas}
#' \item \code{reg.out}: regional testing output obtained using Fletcher-based
#' dispersion estimate; an additional 'ID' row would appear if RanEff is TRUE
#' \item \code{reg.out.reml.scale}:regional testing output obtained using 
#' REML-based dispersion estimate;
#' \item \code{reg.out.gam}:regional testing output obtained using
#' (Fletcher-based) dispersion estimate from mgcv package;
#' \item \code{phi_fletcher}: Fletcher-based estimate of the (multiplicative)
#' dispersion parameter;
#' \item \code{phi_reml}: REML-based estimate of the (multiplicative)
#' dispersion parameter;
#' \item \code{phi_gam}: Estimated dispersion parameter reported by mgcv;
#' \item \code{SE.out}: a matrix of the estimated pointwise Standard Errors
#' (SE); number of rows are the number of unique CpG sites in the input data and
#' the number of columns equal to the total number of covariates fitted in the
#' model (the first one is the intercept);
#' \item \code{SE.out.REML.scale}: a matrix of the estimated pointwise Standard
#' Errors (SE); the SE calculated from the REML-based dispersion estimates
#' \item \code{uni.pos}: the genomic postions for each row of CpG sites in the
#' matrix \code{SE.out};
#' \item \code{Beta.out}: a matrix of the estimated covariate effects beta(t),
#' where t denotes the genomic positions;
#' \item \code{ncovs}: number of functional paramters in the model (including
#' the intercept);
#' \item \code{sigma00}: estimated variance for the random effect if RanEff is
#' TRUE; NA if RanEff is FALSE.
#' }
#' @author  Kaiqiong Zhao
#' @seealso  \link[mgcv]{gam}
#' @examples
#' #------------------------------------------------------------#
#' data(RAdat)
#' RAdat.f <- na.omit(RAdat[RAdat$Total_Counts != 0, ])
#' out <- binomRegMethModel(
#'   data=RAdat.f, n.k=rep(5,3), p0=0.003307034, p1=0.9,
#'   epsilon=10^(-6), epsilon.lambda=10^(-3), maxStep=200
#' )
#' @importFrom mgcv gam
#' @importFrom mgcv predict.gam
#' @importFrom mgcv model.matrix.gam
#' @importFrom mgcv s
#' @importFrom Matrix bdiag
#' @importFrom stats as.formula binomial pchisq rbinom rnorm quasibinomial 
#' @importFrom stats residuals predict model.matrix runif
#'
#' @export
binomRegMethModel <- function(data, n.k, p0 = 0.003, p1 = 0.9, Quasi = TRUE,
                              epsilon = 10^(-6), epsilon.lambda = 10^(-3), 
                              maxStep = 200, binom.link = "logit", 
                              method = "REML", covs = NULL, RanEff = TRUE, 
                              reml.scale = FALSE, scale = -2, verbose = TRUE) {

  t0 <- Sys.time()
  initOut <- binomRegMethModelInit(data=data, covs=covs, n.k=n.k, 
                                   verbose = verbose)
  Z <- initOut$Z
  fitGamOut <- fitGam(data=initOut$data, Quasi=Quasi, binom.link=binom.link,
                      method=method, RanEff=RanEff, scale=scale, Z=Z, n.k=n.k,
                      verbose = verbose)
  phi_fletcher <- phiFletcher(data=initOut$data, Quasi=Quasi, 
                              reml.scale=reml.scale, scale=scale, 
                              gam.int=fitGamOut$gam.int, verbose=verbose)
  fitEMOut <- fitEM(p0=p0, p1=p1, fitGamOut=fitGamOut, n.k=n.k, 
                    binom.link=binom.link, method=method, Z=Z, Quasi=Quasi, 
                    scale=scale, reml.scale=reml.scale, 
                    phi_fletcher=phi_fletcher, epsilon=epsilon, 
                    maxStep=maxStep, data=initOut$data,
                    verbose = verbose)
  out <- fitEMOut$out
  Est.points <- fitEMOut$Est.points
  phi_fletcher <- out$phi_fletcher
  my.design.matrix <- mgcv::model.matrix.gam(out$GamObj)
  lengthUniqueDataID <- length(unique(initOut$data$ID))
  phi_reml <- out$GamObj$reml.scale
  estimateBZOut <- estimateBZ(Posit=initOut$data$Posit, 
                              my.design.matrix=my.design.matrix, 
                              ncolsZ=ncol(Z), n.k=n.k, verbose = verbose)
  Beta.out <- estimateBeta(BZ=estimateBZOut$BZ, BZ.beta=estimateBZOut$BZ.beta,
                           n.k=n.k, Z=Z, out=out, verbose = verbose)
  estimateVarOut <- estimateVar(out=out, phi_fletcher=phi_fletcher, 
                                data=initOut$data, 
                                my.design.matrix=my.design.matrix, Z=Z, p0=p0,
                                p1=p1, RanEff=RanEff, 
                                lengthUniqueDataID=lengthUniqueDataID,n.k=n.k,
                                verbose = verbose)
  estimateSEOut <- estimateSE(estimateBZOut=estimateBZOut, Z=Z, 
                              estimateVarOut=estimateVarOut, 
                              phi_fletcher=phi_fletcher, phi_reml=phi_reml,
                              verbose = verbose)
  X_d <- extractDesignMatrix(GamObj=out$GamObj, verbose = verbose)
  ## and p value calculation tr(2A - A^2)
  edf1.out <- out$edf1
  ## Effective degree of freedom: edf --trace of the hat matrix
  edf.out <- out$edf
  resi_df <- nrow(initOut$data) - sum(edf.out)
  s.table <- binomRegMethModelSummary(GamObj=out$GamObj, 
                                      var.cov.alpha=estimateVarOut$var.cov.alpha,
                                      new.par=out$par, edf.out=edf.out, 
                                      edf1.out=edf1.out, X_d=X_d, 
                                      resi_df=resi_df, Quasi=Quasi, 
                                      scale=scale, RanEff=RanEff, Z=Z,
                                      verbose = verbose)
  s.table.REML.scale <- binomRegMethModelSummary(GamObj=out$GamObj, 
                                                 var.cov.alpha=estimateVarOut$var.cov.alpha/phi_fletcher*phi_reml, 
                                                 new.par=out$par, 
                                                 edf.out=edf.out, 
                                                 edf1.out=edf1.out, 
                                                 X_d=X_d, resi_df=resi_df, 
                                                 Quasi=Quasi, scale=scale,
                                                 RanEff=RanEff, Z=Z,
                                                 verbose = verbose)
  if (RanEff) {
    sigma00 <- out$GamObj$reml.scale/out$GamObj$sp["s(ID)"]
  } else {
    sigma00 <- NA
  }
  
  msg <- paste("Process completed in",
               format(Sys.time() - t0, digits = 2))
  if(verbose) Message(msg, step = "Finished")
  
  return(out <- list(est = out$par, lambda = out$lambda, est.pi = out$pi.ij,
                     Beta.out = Beta.out, phi_fletcher = phi_fletcher, 
                     phi_reml = phi_reml, phi_gam = out$GamObj$scale, 
                     reg.out = s.table,reg.out.reml.scale =s.table.REML.scale,
                     cov1 = estimateVarOut$var.cov.alpha, 
                     reg.out.gam = summary(out$GamObj)$s.table,
                     SE.out = estimateSEOut$SE.out, 
                     SE.out.REML.scale = estimateSEOut$SE.out.REML.scale,
                     uni.pos = estimateBZOut$uni.pos, ncovs = ncol(Z) + 1, 
                     ite.points = Est.points, sigma00 = sigma00))
}
#' @title Run EM algorithm to obtain the estimate of alpha
#'
#' @description Run EM algorithm to obtain the estimate of alpha
#' @param p0 the probability of observing a methylated read when
#' the underlying true
#' status is unmethylated. \code{p0} is the rate of false
#' methylation calls, i.e.
#' false positive rate.
#' @param p1 the probability of observing a methylated read when
#' the underlying true
#' status is methylated. \code{1-p1} is the rate of false
#' non-methylation calls, i.e.
#' false negative rate.
#' @param fitGamOut an output from fitGam
#' @param n.k number of knots for all covariates (including intercept)
#' @param binom.link link for binomial GLM
#' @param method method to estimate lambda
#' @param Z covariate matrix
#' @param Quasi whether quasibinomial is used
#' @param scale negative means unknown scale; positive means fixed scale
#' @param reml.scale whether a REML-based scale estimate is used
#' @param phi_fletcher Fletcher-based scale estimate
#' @param epsilon stopping criteria
#' @param maxStep maximum iteration steps
#' @param data a data frame with rows as individual CpGs
#' appearing in all the samples.
#' The first 4 columns should be named
#' `Y` (methylated counts),
#' `X` (read depths), `Posit` (Genomic
#' position for the CpG site) and `ID`
#' (sample ID). The covariate information, such as
#' disease status or cell type composition
#' are listed in column 5 and onwards.
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return This function return a \code{list} including objects:
#' \itemize{
#' \item \code{out} a list
#' \item \code{Est.points} a matrix storing the estimate in each step
#' }
#' 
#' @author Kaiqiong Zhao
#' @noRd
fitEM <- function(p0, p1, fitGamOut, n.k, binom.link, method, Z, Quasi,
                  scale, reml.scale, phi_fletcher, epsilon, 
                  maxStep, data, verbose = TRUE) {
  
  t0 <- Sys.time()
  msg <- paste("Running EM algorithm to obtain the",
              "estimate of alpha")
  if(verbose) Message(msg)
  
  if (p0 > 0 | p1 < 1) {
    old.pi.ij <-fitGamOut$gam.int$fitted.values
    mean_part <- mean(((old.pi.ij-p0)*(p1-old.pi.ij))/(old.pi.ij*(1-old.pi.ij)),
                      na.rm = TRUE)
    phi_fletcher <- (phi_fletcher-1)/mean_part+1
    out <- binomRegMethModelUpdate(data=data, 
                                   pi.ij=fitGamOut$gam.int$fitted.values,
                                   p0=p0, p1=p1, n.k=n.k, 
                                   binom.link=binom.link, method=method,
                                   Z=Z, my.covar.fm=fitGamOut$my.covar.fm, 
                                   Quasi=Quasi, scale=phi_fletcher, 
                                   reml.scale=reml.scale, verbose = verbose)
    Est.points <- rbind(c(fitGamOut$gam.int$coefficients, 
                          fitGamOut$gam.int$sp, phi_fletcher), 
                        c(out$par, out$lambda, out$phi_fletcher))
    iter <- 1
    while (sqrt(sum((out$pi.ij - old.pi.ij)^2)) > epsilon & iter <
           maxStep) {
      if (verbose) {
        detail_message <- paste0("iteration ", iter)
        Message(detail_message, step = NULL)
      }
      iter <- iter + 1
      old.pi.ij <- out$pi.ij
      out <- binomRegMethModelUpdate(data=data, pi.ij=old.pi.ij, p0=p0, 
                                     p1=p1, n.k=n.k, 
                                     binom.link=binom.link, method=method,
                                     Z=Z, 
                                     my.covar.fm=fitGamOut$my.covar.fm, 
                                     Quasi=Quasi, scale=phi_fletcher, 
                                     reml.scale=reml.scale, verbose = FALSE)
      Est.points <- rbind(Est.points, c(out$par, out$lambda, 
                                        out$phi_fletcher))
    }
  } else {
    out <- list(pi.ij = fitGamOut$gam.int$fitted.values, 
                par = fitGamOut$gam.int$coefficients,
                lambda = fitGamOut$gam.int$sp, 
                edf1 = fitGamOut$gam.int$edf1,
                pearson_res = residuals(fitGamOut$gam.int,
                                        type = "pearson"),
                deviance_res = residuals(fitGamOut$gam.int,
                                         type = "deviance"),
                edf = fitGamOut$gam.int$edf, phi_fletcher = phi_fletcher, 
                GamObj = fitGamOut$gam.int)
    Est.points <- c(fitGamOut$gam.int$coefficients, fitGamOut$gam.int$sp,
                    phi_fletcher)
  }
  
  msg <- paste("Process completed in",
               format(Sys.time() - t0, digits = 2))
  if(verbose) Message(msg, step = "Finished")
  
  return(list(out = out, Est.points = Est.points))
}

#' @title Get the basis matrix for beta0(t) and beta(t)
#'
#' @description This function aims to extract the basis matrix BZ,
#' and BZ.beta from the expansion beta0(t) = BZ * alpha0
#' and beta(t) = BZ.beta * alpha.
#' @param Posit  the column 'Posit' from the input data frame
#' \code{data}
#' @param my.design.matrix the design matrix for a \code{fitGamOut}
#' with data as input
#' and Z as covariates.
#' @param ncolsZ number of columns of covariate matrix Z
#' @param n.k number of knots for all covariates (including intercept)
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return This function return a \code{list} including objects:
#' \itemize{
#' \item \code{uni.pos} the unique genomic positions present in
#' the data
#' \item \code{BZ} basis matrix for the intercept beta0(t); a
#' matrix with
#' \code{length(uni.pos)} rows and \code{length(alpha0)} columns
#' \item \code{BZ.beta} a list of length \code{ncolsZ}; each
#' corresponds to the basis
#' matrix for beta_p(t), p = 1, 2, .. \code{ncolsZ}
#' }
#' @author  Kaiqiong Zhao, Simon Laurin-Lemay
#' @import mgcv
#' @noRd
estimateBZ <- function(Posit, my.design.matrix, ncolsZ, n.k, verbose = TRUE) {
  
  t0 <- Sys.time()
  if(verbose) Message("Get the basis matrix for beta0(t) and beta(t)")
  
  uni.pos <- unique(Posit)
  uni.id <- match(uni.pos, Posit)
  BZ <- my.design.matrix[uni.id, seq_len(n.k[1])]
  BZ.beta <- lapply(seq_len(ncolsZ), function(i) {
    mgcv::smooth.construct(mgcv::s(Posit, k = n.k[i + 1], fx = FALSE,
                                   bs = "cr"), 
                           data = data.frame(Posit = Posit[uni.id]), 
                           knots = NULL)$X
  })
  
  msg <- paste("Process completed in",
               format(Sys.time() - t0, digits = 2))
  if(verbose) Message(msg, step = "Finished")
  
  return(out <- list(uni.pos = uni.pos, BZ = BZ, BZ.beta = BZ.beta))
}

#' @title estimate of beta(t)
#'
#' @description Given a final GAM output, extract the estimates of
#' beta(t) = BZ * alpha, where t is the unique genomic positions
#' present in the input data; only extract the fixed effect estimates
#' @param BZ basis matrix for the intercept beta0(t); a matrix with
#' \code{length(uni.pos)} rows and \code{length(alpha0)} columns
#' @param BZ.beta a list of length \code{ncolsZ}; each corresponds
#' to the basis matrix
#' for beta_p(t)
#' @param n.k a vector of basis dimensions for the intercept and
#' individual covariates. \code{n.k} specifies an upper limit of the degrees
#' of each functional parameters.
#' @param Z a covariate matrix
#' @param out an object from \code{binomRegMethModelUpdate}
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return \code{Beta.out}: a matrix of the estimated covariate
#' effects beta(t),
#' here t denots the unique genomic positions for intercept
#' and all covariates. Beta.out = cbind(beta.0(t), beta.1(t), ...
#' beta.P(t));
#' \code{length(uni.pos)} rows and \code{ncol(Z)+1} columns
#' 
#' @author Kaiqiong Zhao, Simon Laurin-Lemay
#' @noRd
estimateBeta <- function(BZ, BZ.beta, n.k, Z, out, verbose = TRUE) {
  
  t0 <- Sys.time()
  msg <- paste("Given a final GAM output, extract",
               "the estimates of beta(t) = BZ * alpha")
  if(verbose) Message(msg)
  
  ## ------ 3: estimate of beta(t) --- #
  cum_s <- cumsum(n.k)
  alpha.sep <- lapply(seq_len(ncol(Z)), function(i) {
    out$par[(cum_s[i] + 1):cum_s[i + 1]]
  })
  alpha.0 <- out$par[seq_len(n.k[1])]
  
  Beta.out <- cbind(BZ %*% alpha.0, vapply(seq_len(ncol(Z)), function(i) {
    BZ.beta[[i]] %*% alpha.sep[[i]]
  }, FUN.VALUE = rep(1, nrow(BZ))))
  
  colnames(Beta.out) <- c("Intercept", colnames(Z))
  
  msg <- paste("Process completed in",
               format(Sys.time() - t0, digits = 2))
  if(verbose) Message(msg, step = "Finished")
  
  return(Beta.out)
}
#' @title Estimate the variance covariance matrix
#'
#' @description Estimate the variance covariance matrix
#' @param out an output from the update function
#' @param phi_fletcher Fletcher-based scale estimate
#' @param data a data frame with rows as individual CpGs
#' appearing in all the samples.
#' The first 4 columns should be named
#' `Y` (methylated counts),
#' `X` (read depths), `Posit` (Genomic
#' position for the CpG site) and `ID`
#' (sample ID). The covariate information, such as
#' disease status or cell type composition
#' are listed in column 5 and onwards.
#' @param RanEff whether a RE is considered
#' @param n.k number of knots for all covariates (including intercept)
#' @param lengthUniqueDataID number of samples in the data
#' @param p0 the probability of observing a methylated read when
#' the underlying true
#' status is unmethylated. \code{p0} is the rate of false
#' methylation calls, i.e.
#' false positive rate.
#' @param p1 the probability of observing a methylated read when
#' the underlying true
#' status is methylated. \code{1-p1} is the rate of false
#' non-methylation calls, i.e.
#' false negative rate.
#' @param Z covariate matrix
#' @param my.design.matrix design matrix for the all data
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return This function return a \code{list} including objects:
#' \itemize{
#' \item \code{var.cov.alpha} var of alpha
#' \item \code{var.alpha.0} var of alpha0
#' \item \code{var.alpha.sep} var of alpha_p, p = 1, 2, P
#' }
#' 
#' @author Kaiqiong Zhao
#' @noRd
estimateVar <- function(out, phi_fletcher, data, my.design.matrix,
                        Z, p0, p1, RanEff, lengthUniqueDataID, n.k, 
                        verbose = TRUE) {
  
  t0 <- Sys.time()
  if(verbose) Message("Estimate the variance covariance matrix")
  
  cum_s <- cumsum(n.k)
  w_ij <- out$pi.ij * (1 - out$pi.ij)/phi_fletcher
  H <- hessianComp(w_ij=w_ij, new.par=out$par, new.lambda=out$lambda,
                   X=data$X, Y=data$Y, my.design.matrix=my.design.matrix,
                   gam_smoothMat=out$GamObj$smooth, Z=Z, pred.pi=out$pi.ij,
                   p0=p0, p1=p1, disp_est=phi_fletcher, RanEff=RanEff, 
                   N=lengthUniqueDataID, verbose = verbose)
  
  var.cov.alpha <- solve(-H)
  var.alpha.0 <- var.cov.alpha[seq_len(n.k[1]), seq_len(n.k[1])]
  var.alpha.sep <- lapply(seq_len(ncol(Z)), function(i) {
    var.cov.alpha[(cum_s[i] + 1):cum_s[i + 1], (cum_s[i] + 1):cum_s[i+1]]
  })
  
  msg <- paste("Process completed in",
               format(Sys.time() - t0, digits = 2))
  if(verbose) Message(msg, step = "Finished")
  
  return(list(var.alpha.0 = var.alpha.0, var.alpha.sep = var.alpha.sep,
              var.cov.alpha = var.cov.alpha))
}
#' @title estimate SE of beta(t) for each covariates
#'
#' @description estimate SE of beta(t) for each covariates
#' @param estimateBZOut a final Gam Object
#' @param Z the ith smooth/covariate to be evaluated
#' @param estimateVarOut output from estimateVar
#' @param phi_fletcher Fletcher-based dispersion estimate
#' @param phi_reml REML-based dispersion estimate
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return This function return a \code{list} including objects:
#' \itemize{
#' \item \code{SE.out} SE based on phi_fletcher
#' \item \code{SE.out.REML.scale} SE based on phi_reml
#' }
#' 
#' @author Kaiqiong Zhao
#' @noRd
estimateSE <- function(estimateBZOut, Z, estimateVarOut, phi_fletcher,
                       phi_reml, verbose = TRUE) {
  
  t0 <- Sys.time()
  if(verbose) Message("Estimate SE of beta(t) for each covariates")
  
  var.alpha.0 <- estimateVarOut$var.alpha.0
  var.alpha.sep <- estimateVarOut$var.alpha.sep
  SE.out <- cbind(sqrt(pmax(0, 
                            rowSums((estimateBZOut$BZ %*% var.alpha.0) *
                                      estimateBZOut$BZ))), 
                  vapply(seq_len(ncol(Z)), 
                         function(i) {
                           sqrt(pmax(0, 
                                     rowSums((estimateBZOut$BZ.beta[[i]] %*% var.alpha.sep[[i]]) * estimateBZOut$BZ.beta[[i]])))
                         }, 
                         FUN.VALUE = rep(1, length(estimateBZOut$uni.pos))))
  
  rownames(SE.out) <- estimateBZOut$uni.pos
  colnames(SE.out) <- c("Intercept", colnames(Z))
  SE.out.REML.scale <- SE.out/sqrt(phi_fletcher) * sqrt(phi_reml)
  
  msg <- paste("Process completed in",
               format(Sys.time() - t0, digits = 2))
  if(verbose) Message(msg, step = "Finished")
  
  return(list(SE.out = SE.out, SE.out.REML.scale = SE.out.REML.scale))
}
#' @title binomRegMethModelSummary Extract the regional testing
#' results from a GamObj
#'
#' @description Extract the regional testing results from a
#' GamObj
#' @param GamObj an fit from mgcv::gam
#' @param var.cov.alpha Estimated variance-covariance matrix
#' for coefficients alpha
#' @param new.par Estimated coefficients alpha
#' @param edf.out Effective degrees of freedom for each alpha,
#' calculated from the
#' trace of the hat matrix
#' @param edf1.out Effective degrees of freedom 2, calculated
#' from tr(2A - A^2); good
#' for chisquare test
#' @param X_d R from the QR decomposition for the design matrix
#' @param resi_df Residual degrees of freedom; nrow(data) -
#' sum(edf.out)
#' @param Quasi Whether a multiplicative dispersion parameter
#' is added or not;
#' quasibinomial or binomial
#' @param scale nagative values mean scale paramter should be
#' estimated;
#' if a positive value is provided, a fixed scale will be used.
#' @param RanEff Whether a subject random effect is added or not
#' @param Z covariate matrix; \code{Z = data[,-c(1:4)]}
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return s.table a table for each covariate
#' @seealso  \link[mgcv]{summary.gam}
#' @import mgcv
#' @noRd
binomRegMethModelSummary <- function(GamObj, var.cov.alpha, new.par, edf.out,
                                     edf1.out, X_d, resi_df, Quasi, scale, 
                                     RanEff, Z, verbose = TRUE) {
  t0 <- Sys.time()
  msg <- paste("Extract the regional testing results",
               "from a GamObj")
  if(verbose) Message(msg)
  
  ii <- 0
  m <- length(GamObj$smooth)
  df <- edf1 <- edf <- s.pv <- chi.sq <- array(0, m)
  
  msg <- paste("Get the regional summary results for an",
               "individual covariate effect")
  if(verbose) Message(msg)
  
  for (i in seq_len(m)) {
    outOneSmooth <- regResOneSmooth(GamObj=GamObj, i=i, RanEff=RanEff, 
                                    Quasi=Quasi, resi_df=resi_df,
                                    var.cov.alpha=var.cov.alpha, 
                                    new.par=new.par, edf.out=edf.out, 
                                    edf1.out=edf1.out, X_d=X_d)
    res <- outOneSmooth$res
    edf1i <- outOneSmooth$edf1i
    edfi <- outOneSmooth$edfi
    if (!is.null(res)) {
      ii <- ii + 1
      df[ii] <- res$rank
      chi.sq[ii] <- res$stat
      s.pv[ii] <- res$pval
      edf1[ii] <- edf1i
      edf[ii] <- edfi
      names(chi.sq)[ii] <- GamObj$smooth[[i]]$label
    }
    if (ii == 0) {
      df <- edf1 <- edf <- s.pv <- chi.sq <- array(0, 0)
    } else {
      df <- df[seq_len(ii)]
      chi.sq <- chi.sq[seq_len(ii)]
      edf1 <- edf1[seq_len(ii)]
      edf <- edf[seq_len(ii)]
      s.pv <- s.pv[seq_len(ii)]
    }
    if (!Quasi || scale >= 0) {
      s.table <- cbind(edf, df, chi.sq, s.pv)
      dimnames(s.table) <- list(names(chi.sq), c("edf", "Ref.df",
                                                 "Chi.sq", "p-value"))
    } else {
      s.table <- cbind(edf, df, chi.sq/df, s.pv)
      dimnames(s.table) <- list(names(chi.sq), c("edf", "Ref.df",
                                                 "F", "p-value"))
    }
  }
  s.table <- s.table[, -1]
  if (RanEff) {
    rownames(s.table) <- c("Intercept", colnames(Z), "ID")
  } else {
    rownames(s.table) <- c("Intercept", colnames(Z))
  }
  colnames(s.table)[1] <- "EDF"
  
  msg <- paste("Process completed in",
               format(Sys.time() - t0, digits = 2))
  if(verbose) Message(msg, step = "Finished")
  
  return(s.table)
}

#' @title Obtain res for one smooth
#'
#' @description Get the regional summary results for an individual
#' covariate effect
#' @param GamObj a final Gam Object
#' @param i the ith smooth/covariate to be evaluated
#' @param RanEff whether a subject-level RE is added or not
#' @param Quasi whether a quasibinomial dist. is used
#' @param resi_df residual degrees of freedom
#' @param X_d the design matrix (R) for all the smooth
#' @param var.cov.alpha the variance covariance matrix for all the smooth
#' @param new.par the coefficients all the smooth
#' @param edf.out edf1 for all the smooth
#' @param edf1.out edf1 for all the smooth
#' @return This function return a \code{list} including objects:
#' \itemize{
#' \item \code{res} output from mgcv:::testStat or
#' mgcv:::reTest
#' \item \code{edfi} edf for the ith smooth
#' \item \code{edf1i} edf1 for the ith smooth
#' }
#' @author Kaiqiong Zhao
#' @importFrom utils getFromNamespace
#' @noRd
regResOneSmooth <- function(GamObj, i, RanEff, Quasi, resi_df, var.cov.alpha,
                            new.par, edf.out, edf1.out, X_d) {
  
  # import unexposed mgcv function
  # using utils::getFromNamespace to avoid being flagged during R CMD check
  mgcv_reTest <- utils::getFromNamespace("reTest", "mgcv")
  #mgcv_testStat <- utils::getFromNamespace("testStat", "mgcv")
    
  start <- GamObj$smooth[[i]]$first.para
  stop <- GamObj$smooth[[i]]$last.para
  V <- var.cov.alpha[start:stop, start:stop, drop = FALSE]  ## Bayesian
  p <- new.par[start:stop]  ## params for smooth
  edfi <- sum(edf.out[start:stop])  ## edf for this smooth
  edf1i <- sum(edf1.out[start:stop])
  Xt <- X_d[, start:stop, drop = FALSE]
  fx <- if (inherits(GamObj$smooth[[i]], 
                     "tensor.smooth") && !is.null(GamObj$smooth[[i]]$fx)) {
    all(GamObj$smooth[[i]]$fx)
  } else {
    GamObj$smooth[[i]]$fixed
  }
  if (!fx && GamObj$smooth[[i]]$null.space.dim == 0 && !is.null(GamObj$R)) {
    ## random effect or fully penalized term
    res <- if (RanEff) {
      mgcv_reTest(GamObj, i)
    } else {
      NULL
    }
    ## Test the mth smooth for equality to zero (m is not the RE term) and
    ## accounting for all random effects in model Inverted Nychka interval
    ## statistics
  } else {
    if (Quasi) {
      rdf <- resi_df
    } else {
      rdf <- -1
    }
    res <- testStat(p, Xt, V, min(ncol(Xt), edf1i), type = 0,
                           res.df = rdf)
  }
  return(list(res = res, edfi = edfi, edf1i = edf1i))
}

#' @title Some checks for binomRegMethModelChecks
#'
#' @description Check if inputs fit one anoter according to
#' there shapes
#' @param data a data frame with rows as individual CpGs
#' appearing in all the samples.
#' The first 4 columns should be named
#' `Y` (methylated counts),
#' `X` (read depths), `Posit` (Genomic
#' position for the CpG site) and `ID`
#' (sample ID). The covariate information, such as
#' disease status or cell type composition
#' are listed in column 5 and onwards.
#' @param Z matrix for the covariate information
#' @param n.k a vector of basis dimensions for the
#' intercept and individual covariates.
#' \code{n.k} specifies an upper limit of the degrees
#' of each functional parameters.
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @author Simon Laurin-Lemay, Kaiqiong Zhao
#' @noRd
binomRegMethModelChecks <- function(posit, X, Z, n.k, verbose = TRUE) {
  
  t0 <- Sys.time()
  if(verbose) Message("Check inputs consistency")
  
  if(ncol(Z) == 0) {
    e_msg <- paste("The input data should contain at least one covariate")
    Error(e_msg)
  }
  
  if (length(n.k) != (ncol(Z) + 1)) {
    e_msg <- paste("The length of n.k should equal to the number of",
                   "covariates plus 1 (for the intercept)")
    Error(e_msg)
  }
  
  if( max(n.k[-1])> round(length(unique(posit))/20)){
    msg <- paste("We recommend basis dimentions n.k",
                 "approximately equal to the number of",
                 "unique CpGs in the region",
                 "divided by 20")
    if(verbose) Message(msg)
  }
  if (any(X == 0)) {
    e_msg <- paste("The rows with Total_Counts equal to 0",
                   "should be deleted beforehand")
    Error(e_msg)
  }
  if (any(is.na(Z))) {
    Error("The covariate information should not have missing values")
  }
  if (any(!is.numeric(Z))) {
    e_msg <- paste("Please transform the covariate information into",
                   "numeric values, eg. use dummy variables for the",
                   "categorical covariates")
    
    msg <- paste("Process completed in",
                 format(Sys.time() - t0, digits = 2))
    if(verbose) Message(msg, step = "Finished")
    
    Error(e_msg)
  }
}

#' @title Init for binomRegMethModel
#'
#' @description Initialize data
#' @param data a data frame with rows as individual CpGs
#' appearing in all the samples.
#' The first 4 columns should contain the information of
#' `Meth_Counts` (methylated counts),
#' `Total_Counts` (read depths), `Position` (Genomic position
#' for the CpG site) and `ID`
#' (sample ID). The covariate information, such as disease
#' status or cell type composition
#' are listed in column 5 and onwards.
#' @param covs  vector of covariate names. The covariates
#' with names in \code{covs} will
#' be included in the model and their covariate
#' effects will be estimated. The default is to fit all
#' covariates in \code{data}
#' @param n.k a vector of basis dimensions for the intercept
#' and individual covariates.
#' \code{n.k} specifies an upper limit of the degrees of each
#' functional parameters.
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return This function return a \code{list} including objects:
#' \itemize{
#' \item \code{data}: the first four columns are 'Y', 'X',
#' 'Posit' and 'ID'; and the rest
#' are covariate values
#' \item \code{Z}: the matrix obtained by removing the first
#' 4 columns of data
#' }
#' @author Kaiqiong Zhao, Simon Laurin-Lemay
#' @noRd
binomRegMethModelInit <- function(data, covs, n.k, verbose = TRUE) {
  
  t0 <- Sys.time()
  if(verbose) Message("Initialize data for binomRegMethModel")
  
  data <- data.frame(data)
  if (is.factor(data$Position)) {
    ## message('The Position in the data set should be numeric other than a
    ## factor')
    data$Position <- as.numeric(as.character(data$Position))
  }
  if (any(!c("Meth_Counts","Total_Counts",
             "Position","ID") %in% colnames(data))) {
    e_msg <- paste0("Please make sure object \"dat\" has columns named ",
                    "as \"Meth_Counts\", \"Total_Counts\", \"Position\" ",
                    "and \"ID\" ") 
    Error(e_msg)
  }
  data$ID <- as.factor(data$ID)
  colnames(data)[match(c("Meth_Counts", "Total_Counts", "Position"),
                       colnames(data))] <- c("Y", "X", "Posit")
  if (is.null(covs)) {
    # Z: covariate matrix
    Z <- as.matrix(data[, -seq_len(4)], ncol = ncol(data) - 4)
    colnames(Z) <- colnames(data)[-seq_len(4)]
    binomRegMethModelChecks(posit = data$Posit,X = data$X, Z = Z, n.k = n.k,
                            verbose = verbose)
    msg <- paste("Process completed in",
                 format(Sys.time() - t0, digits = 2))
    if(verbose) Message(msg, step = "Finished")
    return(out <- list(data = data, Z = Z))
  } else {
    id <- match(covs, colnames(data))
    if (any(is.na(id))) {
      stop_message <- paste(covs[is.na(id)], 
                            "is not found in the input data frame")
      Error(stop_message)
    } else {
      Z <- as.matrix(data[, id], ncol = length(id))
      colnames(Z) <- covs
      binomRegMethModelChecks(posit = data$Posit, X = data$X, Z = Z, 
                              n.k = n.k, verbose = verbose)
      msg <- paste("Process completed in",
                   format(Sys.time() - t0, digits = 2))
      if(verbose) Message(msg, step = "Finished")
      return(out <- list(data = data, Z = Z))
    }
  }
}

#' @title Fit gam for the initial value
#'
#' @description Fit gam for the initial value
#' @param data a data frame with rows as individual CpGs
#' appearing in all the samples.
#' The first 4 columns should be named
#' `Y` (methylated counts),
#' `X` (read depths), `Posit` (Genomic
#' position for the CpG site) and `ID`
#' (sample ID). The covariate information, such as
#' disease status or cell type composition
#' are listed in column 5 and onwards.
#' @param Quasi whether a Quasi-likelihood estimation
#' approach will be used
#' @param binom.link the link function used in the
#' binomial regression model; the
#' default is the logit link
#' @param method the method used to estimate the
#' smoothing parameters. The default
#' is the 'REML' method which is generally better than
#' prediction based criterion
#' \code{GCV.cp}
#' @param RanEff whether sample-level random effects
#' are added or not
#' @param scale nagative values mean scale paramter
#' should be estimated; if a positive
#' value is provided, a fixed scale will be used.
#' @param Z covariate matrix
#' @param n.k  a vector of basis dimensions for the
#' intercept and individual covariates.
#' \code{n.k} specifies an upper limit of the
#' degrees of each functional parameters
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return This function return a \code{list}
#' including objects:
#' \itemize{
#' \item \code{gam.int}: a gam object from mgcv::gam
#' \item \code{my.covar.fm}: the formula used to fit the gam
#' }
#' @author Kaiqiong Zhao, Simon Laurin-Lemay
#' @noRd
fitGam <- function(data, Quasi, binom.link, method, RanEff, scale, Z, n.k,
                   verbose  = TRUE) {
  
  t0 <- Sys.time()
  if(verbose) Message("Fit gam for the initial value")
  
  ## The smoothing formula corresponding to the Z
  formula.z.part <- vapply(seq_len(ncol(Z)), function(i) {
    paste0("s(Posit, k=n.k[", i + 1, "], fx=FALSE, bs=\"cr\", by=Z[,",
           i, "])")
  }, "")
  my.covar.fm <- paste(c("s(Posit, k=n.k[1], fx=FALSE, bs=\"cr\")", 
                         formula.z.part), collapse = "+")
  
  if (RanEff) {
    my.covar.fm <- paste0(my.covar.fm, "+ s(ID, bs=\"re\")")
  }
  ## Fit gam for the initial value
  if (Quasi) {
    gam.int <- mgcv::gam(as.formula(paste0("Y/X ~", my.covar.fm)),
                         family = quasibinomial(link = binom.link), 
                         weights = data$X, data = data, method = method, 
                         scale = scale)
    msg <- paste("Process completed in",
                 format(Sys.time() - t0, digits = 2))
    if(verbose) Message(msg, step = "Finished")
    return(out <- list(data = data, gam.int = gam.int, 
                       my.covar.fm = my.covar.fm))
  } else {
    gam.int <- mgcv::gam(as.formula(paste0("Y/X ~", my.covar.fm)),
                         family = binomial(link = binom.link), 
                         weights = data$X, data = data,
                         method = method, scale = scale)
    msg <- paste("Process completed in",
                 format(Sys.time() - t0, digits = 2))
    if(verbose) Message(msg, step = "Finished")
    return(out <- list(gam.int = gam.int, my.covar.fm = my.covar.fm))
  }
}

#' @title phiFletcher calculate the Fletcher-based dispersion
#' estimate from the final gam output
#'
#' @description phiFletcher calculate the Fletcher-based dispersion
#' estimate from the
#' final gam output
#' @param data a data frame with rows as individual CpGs
#' appearing in all the samples.
#' The first 4 columns should be named
#' `Y` (methylated counts),
#' `X` (read depths), `Posit` (Genomic
#' position for the CpG site) and `ID`
#' (sample ID). The covariate information, such as
#' disease status or cell type composition
#' are listed in column 5 and onwards.
#' @param Quasi whether a Quasi-likelihood estimation approach
#' will be used
#' @param reml.scale whether a REML-based scale (dispersion)
#' estimator is used. The default
#' is Fletcher-based estimator
#' @param scale nagative values mean scale paramter should be estimated;
#' if a positive value
#' is provided, a fixed scale will be used.
#' @param gam.int a gam object from mgcv::gam
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return This function return a \code{phi_fletcher} object:
#' @author Kaiqiong Zhao, Simon Laurin-Lemay
#' @noRd
phiFletcher <- function(data, Quasi, reml.scale, scale, gam.int, 
                        verbose = TRUE) {
  t0 <- Sys.time()
  msg <- paste("Calculating the Fletcher-based",
               "dispersion estimate from the",
               "final gam output")
  if(verbose) Message(msg)
  
  if (!Quasi) {
    return(phi_fletcher <- 1)
  }
  if (scale > 0) {
    return(phi_fletcher <- scale)
  }
  if (Quasi & scale <= 0) {
    old.pi.ij <- gam.int$fitted.values
    edf.out <- gam.int$edf
    pearsonResiduals <- residuals(gam.int, type = "pearson")
    if (reml.scale) {
      msg <- paste("Process completed in",
                   format(Sys.time() - t0, digits = 2))
      if(verbose) Message(msg, step = "Finished")
      return(phi_fletcher <- gam.int$reml.scale)
    }
    ## * sqrt(data$X)
    my_s <- (1 - 2 * old.pi.ij)/(data$X * old.pi.ij * (1 - old.pi.ij)) *
      (data$Y - data$X * old.pi.ij)
    ## Note: the estimator implemented in the mgcv calculated my_s with an
    ## additional multiplier sqrt(data$X) But from the paper there shouldn't
    ## be this one phi_p=sum( pearsonResiduals^2)/(length(data$Y) -
    ## sum(edf.out))
    
    ## Feb 21, 2020 use edf1 to calculate the pearson's dispersion estimate
    ## not the edf; because I will use the asympototic chi-square dist. of
    ## phi.est March 3, 2020, the dispersion parameter estimated in gam use
    ## the edf.out instead of edf1.out
    phi_p <- sum(pearsonResiduals^2)/(length(data$Y) - sum(edf.out))
    msg <- paste("Process completed in",
                 format(Sys.time() - t0, digits = 2))
    if(verbose) Message(msg, step = "Finished")
    return(phi_fletcher <- phi_p/(1 + mean(my_s)))
  }
}


#' @title extract design matrix
#'
#' @description extract design matrix
#' @param GamObj a mgcv object
#' @param verbose logical indicates if the algorithm should provide progress
#' report information.
#' The default value is TRUE.
#' @return This function return a design matrix \code{X_d}: R matrix
#' from the QR decomposition
#' @importFrom mgcv predict.gam
#' @importFrom mgcv model.matrix.gam
#' @author Kaiqiong Zhao Simon Laurin-Lemay
#' @noRd
extractDesignMatrix <- function(GamObj, verbose = TRUE) {
  
  t0 <- Sys.time()
  if(verbose) Message("Extract design matrix")
  
  ## A more efficient way to extract design matrix. use a random sample of
  ## rows of the data to reduce the computational cost
  if (!is.null(GamObj$R)) {
    msg <- paste("Process completed in",
                 format(Sys.time() - t0, digits = 2))
    if(verbose) Message(msg, step = "Finished")
    return(X_d <- GamObj$R)
  } else {
    X_d <- mgcv::model.matrix.gam(GamObj)
    ## exclude NA's (possible under na.exclude)
    msg <- paste("Process completed in",
                 format(Sys.time() - t0, digits = 2))
    if(verbose) Message(msg, step = "Finished")
    return(X_d <- X_d[!is.na(rowSums(X_d)), ])
  }
}


#' @title testStat from the mgcv package
#'
#' @description testStat function is directly copied from the
#' package mgcv 1.8-26; Implements Wood (2013)
#'  Biometrika 100(1), 221-228
#' @param p the vector of the parameter
#' @param X R from the QR decomposition for the design matrix
#' @param V variance-covariance matrix of the vector of the
#' parameter
#' @param rank the rank of the V;use min(ncol(X),edf1i)
#' @param type specifies the type of truncation to use
#' 0. Default using the fractionally truncated pinv
#' 1. Round down to k if k<= rank < k+0.05, otherwise up.
#' The default value is 0
#' @param res.df  residual dof used to estimate scale.
#'  <=0 implies fixed scale
#' @return This function return the test statistic,
#' p-value and its rank for the given smooth term.
#' @importFrom mgcv predict.gam
#' @importFrom mgcv model.matrix.gam
#' @noRd
testStat <- function(p,X,V,rank=NULL,type=0,res.df= -1) {
  ## Implements Wood (2013) Biometrika 100(1), 221-228
  ## The type argument specifies the type of truncation to use.
  ## on entry `rank' should be an edf estimate
  ## 0. Default using the fractionally truncated pinv.
  ## 1. Round down to k if k<= rank < k+0.05, otherwise up.
  ## res.df is residual dof used to estimate scale. <=0 implies
  ## fixed scale.
  
  mgcv_liu2 <- utils::getFromNamespace("liu2", "mgcv")
  mgcv_simf <- utils::getFromNamespace("simf", "mgcv")
  qrx <- qr(X,tol=0)
  R <- qr.R(qrx)
  V <- R%*%V[qrx$pivot,qrx$pivot,drop=FALSE]%*%t(R)
  V <- (V + t(V))/2
  ed <- eigen(V,symmetric=TRUE)
  ## remove possible ambiguity from statistic...
  siv <- sign(ed$vectors[1,]);siv[siv==0] <- 1
  ed$vectors <- sweep(ed$vectors,2,siv,"*")
  
  k <- max(0,floor(rank))
  nu <- abs(rank - k)     ## fractional part of supplied edf
  if (type==1) { ## round up is more than .05 above lower
    if (rank > k + .05||k==0) k <- k + 1
    nu <- 0;rank <- k
  }
  
  if (nu>0) k1 <- k+1 else k1 <- k
  
  ## check that actual rank is not below supplied rank+1
  r.est <- sum(ed$values > max(ed$values)*.Machine$double.eps^.9)
  if (r.est<k1) {k1 <- k <- r.est;nu <- 0;rank <- r.est}
  
  ## Get the eigenvectors...
  # vec <- qr.qy(qrx,rbind(ed$vectors,matrix(0,nrow(X)-ncol(X),ncol(X))))
  vec <- ed$vectors
  if (k1<ncol(vec)) vec <- vec[,1:k1,drop=FALSE]
  
  ## deal with the fractional part of the pinv...
  if (nu>0&&k>0) {
    if (k>1) vec[,1:(k-1)] <- t(t(vec[,1:(k-1)])/sqrt(ed$val[1:(k-1)]))
    b12 <- .5*nu*(1-nu)
    if (b12<0) b12 <- 0
    b12 <- sqrt(b12)
    B <- matrix(c(1,b12,b12,nu),2,2)
    ev <- diag(ed$values[k:k1]^-.5,nrow=k1-k+1)
    B <- ev%*%B%*%ev
    eb <- eigen(B,symmetric=TRUE)
    rB <- eb$vectors%*%diag(sqrt(eb$values))%*%t(eb$vectors)
    vec1 <- vec
    vec1[,k:k1] <- t(rB%*%diag(c(-1,1))%*%t(vec[,k:k1]))
    vec[,k:k1] <- t(rB%*%t(vec[,k:k1]))
  } else {
    vec1 <- vec <- if (k==0) t(t(vec)*sqrt(1/ed$val[1])) else
      t(t(vec)/sqrt(ed$val[1:k]))
    if (k==1) rank <- 1
  }
  ## there is an ambiguity in the choise of test statistic, leading to slight
  ## differences in the p-value computation depending on which of 2 alternatives
  ## is arbitrarily selected. Following allows both to be computed and p-values
  ## averaged (can't average test stat as dist then unknown)
  d <- t(vec)%*%(R%*%p)
  d <- sum(d^2)
  d1 <- t(vec1)%*%(R%*%p)
  d1 <- sum(d1^2)
  ##d <- d1 ## uncomment to avoid averaging
  
  rank1 <- rank ## rank for lower tail pval computation below
  
  ## note that for <1 edf then d is not weighted by EDF, and instead is
  ## simply refered to a chi-squared 1
  
  if (nu>0) { ## mixture of chi^2 ref dist
    if (k1==1) rank1 <- val <- 1 else {
      val <- rep(1,k1) ##ed$val[1:k1]
      rp <- nu+1
      val[k] <- (rp + sqrt(rp*(2-rp)))/2
      val[k1] <- (rp - val[k])
    }
    
    if (res.df <= 0) pval <- (mgcv_liu2(d,val) + mgcv_liu2(d1,val))/2 else ##  pval <- davies(d,val)$Qq else
      pval <- (mgcv_simf(d,val,res.df) + mgcv_simf(d1,val,res.df))/2
  } else { pval <- 2 }
  ## integer case still needs computing, also liu/pearson approx only good in
  ## upper tail. In lower tail, 2 moment approximation is better (Can check this
  ## by simply plotting the whole interesting range as a contour plot!)
  if (pval > .5) {
    if (res.df <= 0) pval <- (pchisq(d,df=rank1,lower.tail=FALSE)+pchisq(d1,df=rank1,lower.tail=FALSE))/2 else
      pval <- (pf(d/rank1,rank1,res.df,lower.tail=FALSE)+pf(d1/rank1,rank1,res.df,lower.tail=FALSE))/2
  }
  list(stat=d,pval=min(1,pval),rank=rank)
} ## end of testStat
kaiqiong/SOMNiBUS documentation built on Feb. 24, 2023, 5:38 a.m.