myTAI: Evolutionary Transcriptomics

Documented in FlatLineTest

#' @title Perform Flat Line Test
#' @description This function quantifies the statistical significance of an observed phylotranscriptomic pattern. In detail, the \emph{Flat Line Test} quantifies any significant deviation of an observed phylotranscriptomic pattern from a flat line.
#' @param ExpressionSet a standard PhyloExpressionSet or DivergenceExpressionSet object.
#' @param permutations a numeric value specifying the number of permutations that shall be performed for the \emph{FlatLineTest}.
#' @param plotHistogram a logical value indicating whether a detailed statistical analysis concerning the goodness of fit should be performed.
#' @param runs specify the number of runs to be performed for goodness of fit computations. In most cases runs = 100 is a reasonable choice.
#' @param parallel performing \code{runs} in parallel (takes all cores of your multicore machine).
#' @param custom.perm.matrix a custom \code{\link{bootMatrix}} (permutation matrix) to perform the underlying test statistic. Default is \code{custom.perm.matrix = NULL}.
#' @details Internally the function performs N phylotranscriptomics pattern computations (\code{\link{TAI}} or \code{\link{TDI}}) based on sampled PhyloExpressionSets or DivergenceExpressionSets (see \code{\link{bootMatrix}}).
#' The test statistics is being developed as follows:
#'
#' The variance \emph{V_pattern} of the S phylotranscriptomics values defines the test statistic for the \code{\link{FlatLineTest}}.
#' The basic assumption is, that the variance of a flat line should be equivalent to zero for a perfect flat line.
#' Any deviation from a flat line can be measured with a variance value > 0.
#'
#' To determine the null distribution of \emph{V_p}, all PS or DS values within each developmental stage s are randomly permuted,
#'  S surrogate phylotranscriptomics values are computed from this permuted dataset, and a surrogate value of \emph{V_p} is
#'   computed from these S phylotranscriptomics values. This permutation process is repeated N times, yielding a histogram of \emph{V_p}.
#'
#' After applying a \emph{Lilliefors Kolmogorov-Smirnov Test for gamma distribution}, \emph{V_p} is approximated by a \emph{gamma distribution}.
#' The two parameters of the \emph{gamma distribution} are estimated by the function \code{\link[fitdistrplus]{fitdist}} from the \pkg{fitdistrplus} package by \emph{moment matching estimation}.
#' The fitted \emph{gamma distribution} is considered the null distribution of \emph{V_pattern}, and the p-value of the observed value of \emph{V_p}
#' is computed from this null distribution.
#'
#' In case the parameter \emph{plotHistogram = TRUE}, a multi-plot is generated showing:
#'
#' (1) A Cullen and Frey skewness-kurtosis plot generated by \code{\link[fitdistrplus]{descdist}}).
#'
#' (2) A histogram of V_p combined with the density plot using the Method of Moments estimated parameters returned by the \code{\link[fitdistrplus]{fitdist}} function using a gamma distribution.
#'
#' (3) A plot showing the p-values for N independent runs to verify that a specific p-value is biased by a specific permutation order.
#'
#' The \emph{goodness of fit} for the random vector \emph{V_p} is quantified statistically by an adapted Lilliefors (Kolmogorov-Smirnov) test for gamma distributions.
#' @return a list object containing the list elements:
#' \itemize{
#' \item \code{p.value} the p-value quantifying the statistical significance (deviation from a flat line) of the given phylotranscriptomics pattern.
#' \item \code{std.dev} the standard deviation of the N sampled phylotranscriptomics patterns for each developmental stage S.
#' \item \code{std.dev} the Kolmogorov-Smirnov test satistics for fitting a gamma distribution to the variances of the dataset with permuted phylostrata.
#' }
#' @references
#'
#' Drost HG et al. (2015) Mol Biol Evol. 32 (5): 1221-1231 doi:10.1093/molbev/msv012
#'
#' Quint M et al. (2012). A transcriptomic hourglass in plant embryogenesis. Nature (490): 98-101.
#'
#' M. L. Delignette-Muller, R. Pouillot, J.-B. Denis and C. Dutang (2014), fitdistrplus: help to fit of a parametric distribution to non-censored or censored data.
#'
#' Cullen AC and Frey HC (1999) Probabilistic techniques in exposure assessment. Plenum Press, USA, pp. 81-159.
#'
#' Evans M, Hastings N and Peacock B (2000) Statistical distributions. John Wiley and Sons Inc.
#'
#' Sokal RR and Rohlf FJ (1995) Biometry. W.H. Freeman and Company, USA, pp. 111-115.
#'
#' Juergen Gross and bug fixes by Uwe Ligges (2012). nortest: Tests for Normality. R package version 1.0-2.
#'
#' http://CRAN.R-project.org/package=nortest
#'
#' Dallal, G.E. and Wilkinson, L. (1986): An analytic approximation to the distribution of Lilliefors test for normality. The American Statistician, 40, 294-296.
#'
#' Stephens, M.A. (1974): EDF statistics for goodness of fit and some comparisons. Journal of the American Statistical Association, 69, 730-737.
#'
#' http://stackoverflow.com/questions/4290081/fitting-data-to-distributions?rq=1
#'
#' http://stats.stackexchange.com/questions/45033/can-i-use-kolmogorov-smirnov-test-and-estimate-distribution-parameters
#'
#' http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
#'
#' http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
#'
#' @author Hajk-Georg Drost
#' @note In case there are extreme outlier expression values stored in the dataset (PhyloExpressionSet or DivergenceExpressionSet),
#'  the internal \code{\link[fitdistrplus]{fitdist}} function that is based on the \code{\link{bootMatrix}} output might return a warning:
#'   "In densfun(x, parm[1], parm[2], ...) : NaNs were produced" which indicates that permutation results caused by extreme outlier expression values
#'   that could not be fitted accordingly. This warning will not be printed out when the corresponding outlier values are extracted from the dataset.
#' @seealso \code{\link{TAI}}, \code{\link{TDI}}, \code{\link{PlotSignature}}, \code{\link{bootMatrix}}
#' @examples
#'
#' # read standard phylotranscriptomics data
#' data(PhyloExpressionSetExample)
#'
#' # example PhyloExpressionSet using 100 permutations
#' FlatLineTest(PhyloExpressionSetExample,
#'              permutations  = 100,
#'              plotHistogram = FALSE)
#'
#' # use your own permutation matrix based on which p-values (FlatLineTest)
#' # shall be computed
#' custom_perm_matrix <- bootMatrix(PhyloExpressionSetExample,100)
#'
#' FlatLineTest(PhyloExpressionSetExample,
#'              custom.perm.matrix = custom_perm_matrix)
#'
#' @import foreach
#' @export
#'
#'
#'

FlatLineTest <- function(ExpressionSet,
                         permutations       = 10000,
                         plotHistogram      = FALSE,
                         runs               = 10,
                         parallel           = FALSE,
                         custom.perm.matrix = NULL)
{
  is.ExpressionSet(ExpressionSet)
  
  if (plotHistogram & is.null(runs))
    stop(
      "Please specify the number of runs to be performed for the goodness of fit computations.",
      call. = FALSE
    )
  
  nCols <- dim(ExpressionSet)[2]
  resMatrix <- matrix(NA_real_, permutations, (nCols - 2))
  var_values <- vector(mode = "numeric", length = permutations)
  sd_values <-vector(mode = "numeric", length = nCols - 2)
  #random_mean_age <- vector(mode = "numeric", length = permutations)
  age.real <- vector(mode = "numeric", length = nCols - 2)
  age.real <-
    cpp_TAI(as.matrix(dplyr::select(ExpressionSet, 3:ncol(ExpressionSet))), as.vector(unlist(dplyr::select(ExpressionSet, 1))))
  ### compute the real variance of TAIs of the observed TAI/TDI-Hourglass pattern
  real.var <- stats::var(age.real)
  ### sample only the phylostrata (row-permutations)
  
  if (is.null(custom.perm.matrix)) {
    start_time = Sys.time()
    resMatrix <-
      cpp_bootMatrix(as.matrix(dplyr::select(
        ExpressionSet, 3:ncol(ExpressionSet)
      )),
      as.vector(unlist(dplyr::select(
        ExpressionSet, 1
      ))),
      as.numeric(permutations))
    end_time = Sys.time()
    cat("\n")
    cat(paste("Total runtime of your permutation test:", round(end_time - start_time, 3), " seconds."))
  }
  
  else if (!is.null(custom.perm.matrix)) {
    resMatrix <- custom.perm.matrix
  }
  
  var_values <- apply(resMatrix, 1, stats::var)
  #print(sum(var_values > real.var)/length(var_values))
  #random_mean_age <- apply(resMatrix,2,mean)
  ### estimate the parameters (shape,rate)
  ### of the gamma distributed variance values
  ### using: method of moments estimation
  gamma_MME = GetGamma(var_values)
  ### estimate shape:
  shape <- gamma_MME[[1]]
  ### estimate the rate:
  rate <- gamma_MME[[2]]
  ks_test = gamma_MME[[3]]
  # in case the fitting fails
  if (shape == 0 & rate == 0) {
    gamma = fitdistrplus::fitdist(var_values, "gamma", method = "mme")
    shape = gamma$estimate[1]
    rate = gamma$estimate[2]
    suppressWarnings(ks_test <- stats::ks.test(var_values, "pgamma", shape = shape, rate = rate))
  }
  if (permutations < 20000) {
    message("\n")
    message(
      "-> We recommended using at least 20000 permutations to achieve a sufficient permutation test."
    )
  }
  
  if (plotHistogram) {
    gammaDensity <- function(x) {
      return(stats::dgamma(
        x = x,
        shape = shape,
        rate = rate
      ))
      
    }
    
    graphics::par(mfrow = c(1, 3))
    # plot a Cullen and Frey graph
    fitdistrplus::descdist(var_values, boot = permutations)
    # plot the histogram and the fitted curve
    graphics::curve(
      expr = gammaDensity,
      xlim = c(min(c(
        var_values, real.var
      )), max(c(
        var_values, real.var
      ))),
      col  = "steelblue",
      lwd  = 5,
      xlab = "Variances",
      ylab = "Frequency",
      main = paste0("permutations = ", permutations)
    )
    
    histogram <- graphics::hist(
      x      = var_values,
      prob   = TRUE,
      add    = TRUE,
      breaks = permutations / (0.01 * permutations)
    )
    graphics::rug(var_values)
    
    graphics::abline(
      v = real.var,
      lty = 1,
      lwd = 4,
      col = "darkred"
    )
    
    p.vals_vec <- vector(mode = "numeric", length = runs)
    
    if (parallel) {
      ### Parallellizing the sampling process using the 'doParallel' and 'parallel' package
      ### register all given cores for parallelization
      par_cores <-
        parallel::makeForkCluster(parallel::detectCores())
      doParallel::registerDoParallel(par_cores)
      
      ### Perform the sampling process in parallel
      p.vals_vec <-
        as.vector(foreach::foreach(
          i              = 1:runs,
          .combine       = "c",
          .errorhandling = "stop"
        ) %dopar%
          {
            FlatLineTest(ExpressionSet = ExpressionSet,
                         permutations = permutations)$p.value
            
          })
      
      parallel::stopCluster(par_cores)
      
    }
    
    if (!parallel) {
      # sequential computations of p-values
      # initializing the progress bar
      #progressBar <- txtProgressBar(min = 1,max = runs,style = 3)
      
      
      for (i in 1:runs) {
        p.vals_vec[i] <-  FlatLineTest(ExpressionSet = ExpressionSet,
                                      permutations = permutations)$p.value
        
        # printing out the progress
        #setTxtProgressBar(progressBar,i)
      }
    }
    
    graphics::plot(
      x    = p.vals_vec,
      type = "l" ,
      lwd  = 6,
      ylim = c(0, 1),
      col  = "darkblue",
      xlab = "Runs",
      ylab = "P-value",
      main = paste0("runs = ", runs)
    )
    
    graphics::abline(
      h = 0.05,
      lty = 2,
      lwd = 3,
      col = "darkred"
    )
    
  }
  pval <-
    stats::pgamma(real.var,
                  shape = shape,
                  rate = rate,
                  lower.tail = FALSE)
  
  sd_values <- apply(resMatrix, 2, stats::sd)
  return(list(
    p.value = pval,
    std.dev = sd_values,
    ks.test = ks_test
  ))
}


GetGamma <- function(var_values)
{
  iterations = 200
  max_cut = 0.25
  step = max_cut/iterations 
  sorted_vars = sort(var_values, decreasing = TRUE)
  max_p_fit_v = 0
  max_p_i = 0
  for (i in 2:iterations) {
    # to avoid indexing from zero
    # Filtered variances
    filtered_vars <-
      sorted_vars[round(length(var_values) * i * step):length(var_values)]
    
    # Estimate parameters using method of moments
    gamma_fit <-
      fitdistrplus::fitdist(filtered_vars, "gamma", method = "mme")
    shape <- gamma_fit$estimate[1]
    rate <- gamma_fit$estimate[2]
    # Perform Kolmogorov-Smirnov test
    suppressWarnings(ks_result <-
      stats::ks.test(filtered_vars,
                     "pgamma",
                     shape = shape,
                     rate = rate))
    if (ks_result$p.value > max_p_fit_v) {
      max_p_i = i
      max_p_fit_v = ks_result$p.value
    }
  }
  if (max_p_i == 0) {
    gamma_fit <-
      fitdistrplus::fitdist(var_values, "gamma", method = "mme")
    return(list(gamma_fit$estimate[1], gamma_fit$estimate[2],suppressWarnings(ks_result <-
                  stats::ks.test(var_values,
                                 "pgamma",
                                 shape = gamma_fit$estimate[1],
                                 rate = gamma_fit$estimate[2]))))
  }
  b_shape = 0
  b_rate = 0
  ks_best = NULL
  max_p_fit_v = 0
  for (i in -10:10) {
    # Filtered variances
    lb = round(length(var_values) * (max_p_i * step + i * step / 10))
    filtered_vars <-
      sorted_vars[lb:length(var_values)]
    
    # Estimate parameters using method of moments
    gamma_fit <-
      fitdistrplus::fitdist(filtered_vars, "gamma", method = "mme")
    shape <- gamma_fit$estimate[1]
    rate <- gamma_fit$estimate[2]
    # Perform Kolmogorov-Smirnov test
    suppressWarnings(ks_result <-
      stats::ks.test(filtered_vars,
                     "pgamma",
                     shape = shape,
                     rate = rate))
    if (ks_result$p.value > max_p_fit_v) {
      max_p_fit_v = ks_result$p.value
      b_shape = shape
      b_rate = rate
      ks_best = ks_result
    }
  }
  return(list(b_shape, b_rate, ks_best))
}