myTAI: Evolutionary Transcriptomics

Documented in PairwiseTest

#' @title Perform Pairwise Difference Test
#' @description The \emph{Pairwise Difference Test} aims to statistically evaluate the pairwise
#' difference in the phylotranscriptomic pattern between two contrasts based on \code{\link{TAI}} or \code{\link{TDI}} computations.
#' The corresponding p-value quantifies the probability that a given TAI or TDI pattern (or any phylotranscriptomics pattern) 
#' does not differ from the alternative hypothesis (specifying the direction of difference).
#' A p-value < 0.05 indicates that the corresponding phylotranscriptomics pattern does
#' indeed differ.
#' @param ExpressionSet a standard PhyloExpressionSet or DivergenceExpressionSet object.
#' @param modules a list storing two elements: contrast1 and contrast2. Each element expects a numeric
#' vector specifying the developmental stages or experiments that correspond to each module. 
#' For example, \code{module} = list(contrast1 = 1:2, contrast2 = 3:7) divides a dataset 
#' storing seven developmental stages into 2 modules (contrasts).
#' @param altHypothesis a character string defining the alternative hypothesis (i.e. direction of difference)
#' used to quantify the statistical significance in the present phylotranscriptomics pattern.
#' Possible values can be:
#' \itemize{
#' \item \code{altHypothesis} = \code{"greater"} : contrast1 > contrast2
#' \item \code{altHypothesis} = \code{"less"} : contrast1 < contrast2
#' }
#' @param permutations a numeric value specifying the number of permutations to be performed for the \code{ReductiveHourglassTest}.
#' @param lillie.test a boolean value specifying whether the Lilliefors Kolmogorov-Smirnov Test shall be performed to quantify the goodness of fit.
#' @param plotHistogram a boolean value specifying whether a \emph{Lillifor's Kolmogorov-Smirnov-Test} 
#' shall be performed to test the goodness of fit of the approximated distribution, as well as additional plots quantifying the significance
#' of the observed phylotranscriptomic pattern.
#' @param runs specify the number of runs to be performed for goodness of fit computations, in case \code{plotHistogram} = \code{TRUE}.
#' In most cases \code{runs} = 100 is a reasonable choice. Default is \code{runs} = 10 (because it takes less computation time for demonstration purposes).
#' @param parallel performing \code{runs} in parallel (takes all cores of your multicore machine).
#' @param gof.warning a logical value indicating whether non significant goodness of fit results should be printed as warning. Default is \code{gof.warning = FALSE}.
#' @param custom.perm.matrix a custom \code{\link{bootMatrix}} (permutation matrix) to perform the underlying test statistic. Default is \code{custom.perm.matrix = NULL}.
#' @details The \emph{reductive pairwise difference test} is a permutation test based on the following test statistic. 
#'
#' (1) A PhyloExpressionSet is partitioned into contrast pairs - contrast1 and contrast2 - based on prior biological knowledge. 
#' This prior knowledge could include sexual, ecological and genetic backgrounds.
#'
#' (2) The mean \code{\link{TAI}} or \code{\link{TDI}} value for each of the two contrasts contrast1 and contrast2 are computed. 
#'
#' (3) The pairwise differences D_contrast = contrast1 - contrast2 is calculated as final test statistic of the pairwise test,
#' when \code{altHypothesis} is specified as "greater". When \code{altHypothesis} is specified as "less", sign of D_contrast is reversed.
#' 
#' 
#' In order to determine the statistical significance of an observed pairwise difference D_contrast 
#' the following permutation test was performed. Based on the \code{\link{bootMatrix}} D_contrast 
#' is calculated from each of the permuted \code{\link{TAI}} or \code{\link{TDI}} profiles, 
#' approximated by a Gaussian distribution with method of moments estimated parameters returned by \code{\link[fitdistrplus]{fitdist}}, 
#' and the corresponding p-value is computed by \code{\link{pnorm}} given the estimated parameters of the Gaussian distribution. 
#' The \emph{goodness of fit} for the random vector \emph{D_contrast} is statistically quantified by an Lilliefors (Kolmogorov-Smirnov) test 
#' for normality.
#'
#'
#' In case the parameter \emph{plotHistogram = TRUE}, a multi-plot is generated showing:
#'        
#' (1) A Cullen and Frey skewness-kurtosis plot generated by \code{\link[fitdistrplus]{descdist}}. 
#' This plot illustrates which distributions seem plausible to fit the resulting permutation vector D_contrast 
#' In the case of the \emph{pairwise difference test} a normal distribution seemed plausible.
#'
#' (2) A histogram of D_contrast combined with the density plot is plotted. D_contrast is then fitted by a normal distribution. 
#' The corresponding parameters are estimated by \emph{moment matching estimation} using the \code{\link[fitdistrplus]{fitdist}} function.
#'
#' (3) A plot showing the p-values for N independent runs to verify that a specific p-value is biased by a specific permutation order.
#'
#' (4) A barplot showing the number of cases in which the underlying goodness of fit (returned by Lilliefors (Kolmogorov-Smirnov) test 
#' for normality) has shown to be significant (\code{TRUE}) or not significant (\code{FALSE}). 
#' This allows to quantify the permutation bias and their implications on the goodness of fit.
#' @return a list object containing the list elements:
#' 
#' \code{p.value} : the p-value quantifying the statistical significance of the given phylotranscriptomics pattern.
#'
#' \code{std.dev} : the standard deviation of the N sampled phylotranscriptomics patterns for each developmental stage S.
#' 
#' \code{lillie.test} : a boolean value specifying whether the \emph{Lillifors KS-Test} returned a p-value > 0.05, 
#' which indicates that fitting the permuted scores with a normal distribution seems plausible.
#' 
#' @references 
#' 
#' Drost HG et al. (2015) Mol Biol Evol. 32 (5): 1221-1231 doi:10.1093/molbev/msv012
#'
#' Lotharukpong JS et al. (2024) (unpublished)
#' 
#' @author Hajk-Georg Drost and Jaruwatana Sodai Lotharukpong
#' @seealso \code{\link{lcScore}}, \code{\link{bootMatrix}}, \code{\link{FlatLineTest}},\code{\link{ReductiveHourglassTest}}, \code{\link{ReverseHourglassTest}}, \code{\link{PlotSignature}}
#' @examples
#' 
#' data(PhyloExpressionSetExample)
#'
#' # perform the pairwise difference test for a PhyloExpressionSet
#' # here the prior biological knowledge is that stages 1-2 correspond to contrast1,
#' # stages 3-7 correspond to contrast 2.
#' # We test whether TAI in contrast1 is greater than contrast 2.
#' PairwiseTest(PhyloExpressionSetExample,
#'          modules = list(contrast1 = 1:2, contrast2 = 3:7), 
#'          altHypothesis = "greater",
#'          permutations = 1000)
#'
#' # We can also test whether TAI in contrast1 is less than contrast 2.
#' PairwiseTest(PhyloExpressionSetExample,
#'          modules = list(contrast1 = 1:2, contrast2 = 3:7), 
#'          altHypothesis = "less",
#'          permutations = 1000)
#' 
#' # if we only want to test whether TAI in stage 1 (contrast 1) is greater than stage 3 (contrast 2).
#' PairwiseTest(PhyloExpressionSetExample,
#'          modules = list(contrast1 = 1, contrast2 = 2), 
#'          altHypothesis = "greater",
#'          permutations = 1000)
#' 
#' # use your own permutation matrix based on which p-values (PairwiseTest)
#' # shall be computed
#' custom_perm_matrix <- bootMatrix(PhyloExpressionSetExample,100)
#' # We test whether TAI in contrast1 is greater than contrast 2.
#' PairwiseTest(PhyloExpressionSetExample,
#'          modules = list(contrast1 = 1:2, contrast2 = 3:7), 
#'          altHypothesis = "greater",
#'          custom.perm.matrix = custom_perm_matrix)
#'                        
#' @import foreach
#' @export

PairwiseTest <- function(ExpressionSet,
                     modules            = NULL,
                     altHypothesis      = NULL,
                     permutations       = 1000, 
                     lillie.test        = FALSE, 
                     plotHistogram      = FALSE, 
                     runs               = 10, 
                     parallel           = FALSE,
                     gof.warning        = FALSE,
                     custom.perm.matrix = NULL){
  
  is.ExpressionSet(ExpressionSet)
  
  if(is.null(modules))
    stop("Please specify the two modules: contrast1 and contrast2 using the argument 'modules = list(contrast1 = ..., contrast2 = ...)'.", call. = FALSE)
  
  if(any(table(unlist(modules)) > 1))
    stop("Intersecting modules are not defined for the PairwiseTest.", call. = FALSE)
  
  if(length(modules) != 2)
    stop("Please specify two modules: contrast1 and contrast2 to perform the PairwiseTest.", call. = FALSE)
  
  if(!names(modules)[1] == "contrast1" & !names(modules)[2] == "contrast2")
    stop("Please specify two modules: contrast1 and contrast2 to perform the PairwiseTest.", call. = FALSE)
  
  # Perhaps this requirement in other test can be relaxed for the PairwiseTest
  # if(length(unlist(modules)) != (dim(ExpressionSet)[2] - 2))
  #   stop("The number of stages classified into the two modules does not match the total number of stages stored in the given ExpressionSet.", call. = FALSE)
  if(length(unlist(modules)) > (dim(ExpressionSet)[2] - 2))
    stop("The module selection is outside the range of the given ExpressionSet.", call. = FALSE)
  
  
  nCols <- dim(ExpressionSet)[2]
  score_vector <- vector(mode = "numeric",length = permutations)
  resMatrix <- matrix(NA_real_, permutations,(nCols-2))
  real_age <- vector(mode = "numeric",length = nCols-2)
  real_age <- cpp_TAI(as.matrix(dplyr::select(ExpressionSet, 3:ncol(ExpressionSet))),
                      as.vector(unlist(dplyr::select(ExpressionSet, 1))))
  
  # compute the real pairwise score of the observed phylotranscriptomics pattern
  # pairScore = pairwise difference score
  real_pscore <- pairScore(real_age,contrast1 = modules[[1]],contrast2 = modules[[2]],altHypothesis=altHypothesis)
  options(warn=1)
  
  ### compute the bootstrap matrix 
  if (is.null(custom.perm.matrix)){
    resMatrix <- cpp_bootMatrix(as.matrix(dplyr::select(ExpressionSet, 3:ncol(ExpressionSet))),as.vector(unlist(dplyr::select(ExpressionSet, 1))),as.numeric(permutations))
  }
  
  else if (!is.null(custom.perm.matrix)){
    resMatrix <- custom.perm.matrix
  }
  
  ### compute the global phylotranscriptomics destruction scores for each sampled age vector
  score_vector <- apply(resMatrix, 1 ,pairScore,contrast1 = modules[[1]],
                        contrast2 = modules[[2]],altHypothesis=altHypothesis)
  
  
  # parameter estimators using MASS::fitdistr
  param <- fitdistrplus::fitdist(score_vector,"norm", method = "mme")
  mu <- param$estimate[1]
  sigma <- param$estimate[2]
  
  if(plotHistogram){
    
    if(runs < 1)
      stop("You need at least one run...")
    
    if(lillie.test)
      graphics::par(mfrow = c(2,2))
    if(!lillie.test)
      graphics::par(mfrow = c(1,3))
    
    fitdistrplus::descdist(score_vector, boot = permutations)
    
    # plot histogram of scores
    normDensity <- function(x){
      
      return(stats::dnorm(x,mu,sigma))
      
    }
    
    graphics::curve( expr = normDensity,
                     xlim = c(min(score_vector,real_pscore),max(score_vector,real_pscore)),
                     col  = "steelblue",
                     lwd  = 5,
                     xlab = "Scores",
                     ylab = "Frequency" )
    
    graphics::hist( x      = score_vector,
                    prob   = TRUE,
                    add    = TRUE, 
                    breaks = permutations / (0.01 * permutations) )
    
    graphics::rug(score_vector)
    
    # plot a red line at the position where we can find the real lt value
    graphics::abline(v = real_pscore, lwd = 5, col = "darkred")
    
    #legend("topleft", legend = "A", bty = "n")
    
    p.vals_vec <- vector(mode = "numeric", length = runs)
    lillie_vec <- vector(mode = "logical", length = runs)
    pct <- vector(mode = "list", length = 3)
    
    #cat("\n")
    
    if(parallel){
      
      ### Parallellizing the sampling process using the 'doParallel' and 'parallel' package
      ### register all given cores for parallelization
      par_cores <- parallel::makeForkCluster(parallel::detectCores())
      doParallel::registerDoParallel(par_cores)
      
      # perform the sampling process in parallel
      parallel_results <- foreach::foreach(i              = 1:runs,
                                           .combine       = "rbind",
                                           .errorhandling = "stop") %dopar% {
                                             
                                             
                                             data.frame(PairwiseTest( ExpressionSet = ExpressionSet,
                                                                  permutations  = permutations,
                                                                  altHypothesis = altHypothesis,
                                                                  lillie.test   = TRUE,
                                                                  plotHistogram = FALSE, 
                                                                  modules       = modules )[c(1,3)])
                                             
                                           }
      
      parallel::stopCluster(par_cores)
      
      colnames(parallel_results) <- c("p.value","lillie.test")
      
      p.vals_vec <- parallel_results[ ,"p.value"]
      lillie_vec <- parallel_results[ ,"lillie.test"]
      
    }
    
    if(!parallel){
      
      # sequential computations of p-values 
      #                         if(runs >= 10){
      #                                 # initializing the progress bar
      #                                 #progressBar <- txtProgressBar(min = 1,max = runs,style = 3)
      #                                 
      #                         }
      
      for(i in 1:runs){
        
        if(lillie.test){
          
          pct <- PairwiseTest( ExpressionSet = ExpressionSet,
                           permutations  = permutations,
                           altHypothesis = altHypothesis,
                           lillie.test   = TRUE, 
                           plotHistogram = FALSE,
                           modules       = list(contrast1 = modules[[1]],contrast1 = modules[[2]]),
                           runs          = NULL ) 
        }
        
        
        if(!lillie.test){
          
          pct <- PairwiseTest( ExpressionSet = ExpressionSet,
                           permutations  = permutations,
                           altHypothesis = altHypothesis,
                           lillie.test   = FALSE, 
                           plotHistogram = FALSE,
                           modules       = list(contrast1 = modules[[1]],contrast1 = modules[[2]]),
                           runs          = NULL )
        }
        
        p.vals_vec[i] <- pct$p.value
        
        if(lillie.test)
          lillie_vec[i] <- pct$lillie.test
        
        #                                 if(runs >= 10){
        #                                         # printing out the progress
        #                                         setTxtProgressBar(progressBar,i)
        #                                 }
      }
    }
    
    graphics::plot( x    = p.vals_vec,
                    type = "l" , 
                    lwd  = 6, 
                    ylim = c(0,1), 
                    col  = "darkblue", 
                    xlab = "Runs", 
                    ylab = "p-value" )
    
    graphics::abline(h = 0.05, lty = 2, lwd = 3)
    
    if(lillie.test){
      tbl <- table(factor(lillie_vec, levels = c("FALSE","TRUE")))
      
      graphics::barplot( height    = tbl/sum(tbl) , 
                         beside    = TRUE, 
                         names.arg = c("FALSE", "TRUE"), 
                         ylab      = "relative frequency", 
                         main      = paste0("runs = ",runs) )
    }
  }
  
  
  #if(real_pscore >= 0)
  pval <- stats::pnorm(real_pscore,mean = mu,sd = sigma,lower.tail = FALSE)
  
  #if(real_pscore < 0)
  #pval <- pnorm(real_pscore,mean=mu,sd=sigma,lower.tail=TRUE)
  ### computing the standard deviation of the sampled TAI values for each stage separately
  sd_vals <- vector(mode = "numeric",length = nCols-2)
  sd_vals <- apply(resMatrix,2,stats::sd)
  
  if(lillie.test){
    # perform Lilliefors K-S-Test
    lillie_p.val <- nortest::lillie.test(score_vector)$p.value
    # does the Lilliefors test pass the criterion
    lillie_bool <- (lillie_p.val > 0.05)
    
    if(gof.warning & (lillie_p.val < 0.05) & (!plotHistogram)){
      warning("Lilliefors (Kolmogorov-Smirnov) test for normality did not pass the p > 0.05 criterion!")
    }
  }
  
  if(lillie.test)
    return(list(p.value = pval,std.dev = sd_vals,lillie.test = lillie_bool))
  if(!lillie.test)
    return(list(p.value = pval,std.dev = sd_vals,lillie.test = NA))
  
  
}