myTAI: Evolutionary Transcriptomics

Documented in ReverseHourglassTest

#' @title Perform the Reverse Hourglass Test
#' @description The \emph{Reverse Hourglass Test} aims to statistically evaluate the
#' existence of a reverse hourglass pattern based on \code{\link{TAI}} or \code{\link{TDI}} computations.
#' The corresponding p-value quantifies the probability that a given TAI or TDI pattern (or any phylotranscriptomics pattern)
#' does follow an hourglass like shape. A p-value < 0.05 indicates that the corresponding phylotranscriptomics pattern does
#' rather follow a reverse hourglass (low-high-low) shape.
#' @param ExpressionSet a standard PhyloExpressionSet or DivergenceExpressionSet object.
#' @param modules a list storing three elements: early, mid, and late. Each element expects a numeric
#' vector specifying the developmental stages or experiments that correspond to each module.
#' For example, \code{module} = list(early = 1:2, mid = 3:5, late = 6:7) devides a dataset
#' storing seven developmental stages into 3 modules.
#' @param permutations a numeric value specifying the number of permutations to be performed for the \code{ReverseHourglassTest}.
#' @param lillie.test a boolean value specifying whether the Lilliefors Kolmogorov-Smirnov Test shall be performed to quantify the goodness of fit.
#' @param plotHistogram a boolean value specifying whether a \emph{Lillifor's Kolmogorov-Smirnov-Test}
#' shall be performed to test the goodness of fit of the approximated distribution, as well as additional plots quantifying the significance
#' of the observed phylotranscriptomic pattern.
#' @param runs specify the number of runs to be performed for goodness of fit computations, in case \code{plotHistogram} = \code{TRUE}.
#' In most cases \code{runs} = 100 is a reasonable choice. Default is \code{runs} = 10 (because it takes less computation time for demonstration purposes).
#' @param parallel performing \code{runs} in parallel (takes all cores of your multicore machine).
#' @param gof.warning a logical value indicating whether non significant goodness of fit results should be printed as warning. Default is \code{gof.warning = FALSE}.
#' @param custom.perm.matrix a custom \code{\link{bootMatrix}} (permutation matrix) to perform the underlying test statistic. Default is \code{custom.perm.matrix = NULL}.
#' @details
#' The reverse hourglass test is a permutation test based on the following test statistic.
#'
#' (1) A set of developmental stages is partitioned into three modules - early, mid, and late - based on prior biological knowledge.
#'
#' (2) The mean \code{\link{TAI}} or \code{\link{TDI}} value for each of the three modules T_early, T_mid, and T_late are computed.
#'
#' (3) The two differences \code{D1 = T_mid - T_early} and \code{D2 = T_mid - T_late} are calculated.
#'
#' (4) The minimum \code{D_min} of \code{D1} and \code{D2} is computed as final test statistic of the reductive hourglass test.
#'
#'
#' In order to determine the statistical significance of an observed minimum difference D_min
#' the following permutation test was performed. Based on the \code{\link{bootMatrix}} D_min
#' is calculated from each of the permuted \code{\link{TAI}} or \code{\link{TDI}} profiles,
#' approximated by a Gaussian distribution with method of moments estimated parameters returned by \code{\link[fitdistrplus]{fitdist}},
#' and the corresponding p-value is computed by \code{\link{pnorm}} given the estimated parameters of the Gaussian distribution.
#' The \emph{goodness of fit} for the random vector \emph{D_min} is statistically quantified by an Lilliefors (Kolmogorov-Smirnov) test for normality.
#'
#'
#' In case the parameter \emph{plotHistogram = TRUE}, a multi-plot is generated showing:
#'
#' (1) A Cullen and Frey skewness-kurtosis plot generated by \code{\link[fitdistrplus]{descdist}}.
#' This plot illustrates which distributions seem plausible to fit the resulting permutation vector D_min.
#' In the case of the \emph{Reverse Hourglass Test} a normal distribution seemed plausible.
#'
#' (2) A histogram of \code{D_min} combined with the density plot is plotted. D_min is then fitted by a normal distribution.
#' The corresponding parameters are estimated by \emph{moment matching estimation} using the \code{\link[fitdistrplus]{fitdist}} function.
#'
#' (3) A plot showing the p-values for N independent runs to verify that a specific p-value is biased by a specific permutation order.
#'
#' (4) A barplot showing the number of cases in which the underlying goodness of fit (returned by Lilliefors (Kolmogorov-Smirnov) test
#' for normality) has shown to be significant (\code{TRUE}) or not significant (\code{FALSE}).
#' This allows to quantify the permutation bias and their implications on the goodness of fit.
#' @return a list object containing the list elements:
#'
#' \code{p.value} : the p-value quantifying the statistical significance (low-high-low pattern) of the given phylotranscriptomics pattern.
#'
#' \code{std.dev} : the standard deviation of the N sampled phylotranscriptomics patterns for each developmental stage S.
#'
#' \code{lillie.test} : a boolean value specifying whether the \emph{Lillifors KS-Test} returned a p-value > 0.05,
#' which indicates that fitting the permuted scores with a normal distribution seems plausible.
#' @references
#'
#' Drost HG et al. (2015) Mol Biol Evol. 32 (5): 1221-1231 doi:10.1093/molbev/msv012
#'
#' Quint M et al. (2012). A transcriptomic hourglass in plant embryogenesis. Nature (490): 98-101.
#'
#' M. L. Delignette-Muller, R. Pouillot, J.-B. Denis and C. Dutang (2014), fitdistrplus: help to fit
#' of a parametric distribution to non-censored or censored data.
#'
#' Cullen AC and Frey HC (1999) Probabilistic techniques in exposure assessment. Plenum Press, USA, pp. 81-159.
#'
#' Evans M, Hastings N and Peacock B (2000) Statistical distributions. John Wiley and Sons Inc.
#'
#' Sokal RR and Rohlf FJ (1995) Biometry. W.H. Freeman and Company, USA, pp. 111-115.
#'
#' Juergen Gross and bug fixes by Uwe Ligges (2012). nortest: Tests for Normality. R package version
#' 1.0-2.
#'
#' http://CRAN.R-project.org/package=nortest
#'
#' Dallal, G.E. and Wilkinson, L. (1986): An analytic approximation to the distribution of Lilliefors' test for normality. The American Statistician, 40, 294-296.
#'
#' Stephens, M.A. (1974): EDF statistics for goodness of fit and some comparisons. Journal of the American Statistical Association, 69, 730-737.
#'
#' http://stackoverflow.com/questions/4290081/fitting-data-to-distributions?rq=1
#'
#' http://stats.stackexchange.com/questions/45033/can-i-use-kolmogorov-smirnov-test-and-estimate-distribution-parameters
#'
#' http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
#'
#' http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
#'
#' @author Hajk-Georg Drost
#' @seealso \code{\link{reversehourglassScore}}, \code{\link{bootMatrix}}, \code{\link{FlatLineTest}}, \code{\link{EarlyConservationTest}}, \code{\link{PlotSignature}}
#' @examples
#'
#' data(PhyloExpressionSetExample)
#'
#' # perform the reductive hourglass test for a PhyloExpressionSet
#' # here the prior biological knowledge is that stages 1-2 correspond to module 1 = early,
#' # stages 3-5 to module 2 = mid (phylotypic module), and stages 6-7 correspond to
#' # module 3 = late
#' ReverseHourglassTest(PhyloExpressionSetExample,
#'                        modules = list(early = 1:2, mid = 3:5, late = 6:7),
#'                        permutations = 1000)
#'
#'
#' # use your own permutation matrix based on which p-values (ReverseHourglassTest)
#' # shall be computed
#' custom_perm_matrix <- bootMatrix(PhyloExpressionSetExample,100)
#'
#' ReverseHourglassTest(PhyloExpressionSetExample,
#'                      modules = list(early = 1:2, mid = 3:5, late = 6:7),
#'                      custom.perm.matrix = custom_perm_matrix)
#'
#'
#' @import foreach
#' @export

ReverseHourglassTest <- function(ExpressionSet,
                                 modules            = NULL,
                                 permutations       = 1000,
                                 lillie.test        = FALSE,
                                 plotHistogram      = FALSE,
                                 runs               = 10,
                                 parallel           = FALSE,
                                 gof.warning        = FALSE,
                                 custom.perm.matrix = NULL)
{
        is.ExpressionSet(ExpressionSet)
        
        if (is.null(modules))
                stop(
                        "Please specify the three modules: early, mid, and late using the argument 'modules = list(early = ..., mid = ..., late = ...)'.",
                        call. = FALSE
                )
        
        if (any(table(unlist(modules)) > 1))
                stop("Intersecting modules are not defined for the ReductiveHourglassTest.",
                     call. = FALSE)
        
        if (length(modules) != 3)
                stop(
                        "Please specify three modules: early, mid, and late to perform the ReductiveHourglassTest.",
                        call. = FALSE
                )
        
        if (length(unlist(modules)) != (dim(ExpressionSet)[2] - 2))
                stop(
                        "The number of stages classified into the three modules does not match the total number of stages stored in the given ExpressionSet.",
                        call. = FALSE
                )
        
        
        nCols <- dim(ExpressionSet)[2]
        score_vector <-
                vector(mode = "numeric", length = permutations)
        resMatrix <- matrix(NA_real_, permutations, (nCols - 2))
        real_age <- vector(mode = "numeric", length = nCols - 2)
        real_age <-
                cpp_TAI(as.matrix(dplyr::select(
                        ExpressionSet, 3:ncol(ExpressionSet)
                )), as.vector(unlist(dplyr::select(
                        ExpressionSet, 1
                ))))
        ### compute the real reductive hourglass scores of the observed phylotranscriptomics pattern
        real_score <- reversehourglassScore(
                real_age,
                early         = modules[[1]],
                mid           = modules[[2]],
                late          = modules[[3]],
                method        = "min",
                scoringMethod = "mean-mean",
                profile.warn=T
        )
        options(warn=1)
        
        ### compute the bootstrap matrix
        if (is.null(custom.perm.matrix)) {
                resMatrix <-
                        cpp_bootMatrix(
                                as.matrix(dplyr::select(
                                        ExpressionSet, 3:ncol(ExpressionSet)
                                )),
                                as.vector(unlist(
                                        dplyr::select(ExpressionSet, 1)
                                )),
                                as.numeric(permutations)
                        )
        }
        
        else if (!is.null(custom.perm.matrix)) {
                resMatrix <- custom.perm.matrix
        }
        
        ### compute the global phylotranscriptomics destruction scores foe each sampled age vector
        score_vector <-
                apply(
                        resMatrix,
                        1 ,
                        reversehourglassScore,
                        early = modules[[1]],
                        mid = modules[[2]],
                        late = modules[[3]],
                        method = "min",
                        scoringMethod = "mean-mean"
                )
        
        
        
        # parameter estimators using MASS::fitdistr
        param <-
                fitdistrplus::fitdist(score_vector, "norm", method = "mme")
        mu <- param$estimate[1]
        sigma <- param$estimate[2]
        
        if (plotHistogram == TRUE) {
                # plot histogram of scores
                normDensity <- function(x) {
                        return(stats::dnorm(x, mu, sigma))
                        
                }
                
                if (lillie.test)
                        graphics::par(mfrow = c(2, 2))
                if (!lillie.test)
                        graphics::par(mfrow = c(1, 3))
                
                fitdistrplus::descdist(score_vector, boot = permutations)
                
                graphics::curve(
                        expr = normDensity,
                        xlim = c(
                                min(score_vector, real_score),
                                max(score_vector, real_score)
                        ),
                        col  = "steelblue",
                        lwd  = 5,
                        xlab = "Scores",
                        ylab = "Frequency"
                )
                
                graphics::hist(
                        x      = score_vector,
                        prob   = TRUE,
                        add    = TRUE,
                        breaks = permutations / (0.01 * permutations)
                )
                
                graphics::rug(score_vector)
                # plot a red line at the position where we can find the real rh value
                graphics::abline(v = real_score,
                                 lwd = 5,
                                 col = "darkred")
                
                p.vals_vec <-
                        vector(mode = "numeric", length = runs)
                lillie_vec <-
                        vector(mode = "logical", length = runs)
                rht <- vector(mode = "list", length = 3)
                
                cat("\n")
                
                if (parallel) {
                        ### Parallellizing the sampling process using the 'doParallel' and 'parallel' package
                        ### register all given cores for parallelization
                        par_cores <-
                                parallel::makeForkCluster(parallel::detectCores())
                        doParallel::registerDoParallel(par_cores)
                        
                        # perform the sampling process in parallel
                        parallel_results <-
                                foreach::foreach(
                                        i              = 1:runs,
                                        .combine       = "rbind",
                                        .errorhandling = "stop"
                                ) %dopar% {
                                        data.frame(
                                                ReverseHourglassTest(
                                                        ExpressionSet = ExpressionSet,
                                                        permutations  = permutations,
                                                        lillie.test   = TRUE,
                                                        plotHistogram = FALSE,
                                                        modules       = modules
                                                )[c(1, 3)]
                                        )
                                        
                                }
                        
                        parallel::stopCluster(par_cores)
                        
                        colnames(parallel_results) <-
                                c("p.value", "lillie.test")
                        
                        p.vals_vec <- parallel_results[ , "p.value"]
                        lillie_vec <-
                                parallel_results[ , "lillie.test"]
                        
                }
                
                if (!parallel) {
                        # sequential computations of p-values
                        #                         if(runs >= 10){
                        #                                 # initializing the progress bar
                        #                                 progressBar <- txtProgressBar(min = 1,max = runs,style = 3)
                        #
                        #                         }
                        
                        for (i in 1:runs) {
                                if (lillie.test) {
                                        rht <- ReverseHourglassTest(
                                                ExpressionSet = ExpressionSet,
                                                permutations  = permutations,
                                                lillie.test   = TRUE,
                                                plotHistogram = FALSE,
                                                modules       = list(
                                                        early = modules[[1]],
                                                        mid = modules[[2]],
                                                        late = modules[[3]]
                                                ),
                                                runs          = NULL
                                        )
                                }
                                
                                if (!lillie.test) {
                                        rht <- ReverseHourglassTest(
                                                ExpressionSet        = ExpressionSet,
                                                permutations         = permutations,
                                                lillie.test          = FALSE,
                                                plotHistogram        = FALSE,
                                                modules = list(
                                                        early = modules[[1]],
                                                        mid = modules[[2]],
                                                        late = modules[[3]]
                                                ),
                                                runs                 = NULL
                                        )
                                }
                                
                                p.vals_vec[i] <- rht$p.value
                                
                                if (lillie.test)
                                        lillie_vec[i] <-
                                        rht$lillie.test
                                
                                #                                 if(runs >= 10){
                                #                                         # printing out the progress
                                #                                         setTxtProgressBar(progressBar,i)
                                #                                 }
                        }
                }
                
                #cat("\n")
                
                graphics::plot(
                        x    = p.vals_vec,
                        type = "l" ,
                        lwd  = 6,
                        ylim = c(0, 1),
                        col  = "darkblue",
                        xlab = "Runs",
                        ylab = "p-value"
                )
                
                graphics::abline(h = 0.05,
                                 lty = 2,
                                 lwd = 3)
                
                if (lillie.test) {
                        tbl <- table(factor(lillie_vec, levels = c("FALSE", "TRUE")))
                        
                        graphics::barplot(
                                height    = tbl / sum(tbl),
                                beside    = TRUE,
                                names.arg = c("FALSE", "TRUE"),
                                ylab      = "Relative Frequency",
                                main      = paste0("runs = ", runs)
                        )
                }
        }
        
        
        #if(real_score >= 0)
        pval <-
                stats::pnorm(
                        real_score,
                        mean = mu,
                        sd = sigma,
                        lower.tail = FALSE
                )
        
        #if(real_score < 0)
        #pval <- pnorm(real_score,mean=mu,sd=sigma,lower.tail=TRUE)
        ### computing the standard deviation of the sampled TAI values for each stage separately
        sd_vals <- vector(mode = "numeric", length = nCols - 2)
        sd_vals <- apply(resMatrix, 2, stats::sd)
        
        if (lillie.test) {
                # perform Lilliefors K-S-Test
                lillie_p.val <-
                        nortest::lillie.test(score_vector)$p.value
                # does the Lilliefors test pass the criterion
                lillie_bool <- (lillie_p.val > 0.05)
                
                if (gof.warning &
                    (lillie_p.val < 0.05) &
                    (plotHistogram == FALSE)) {
                        warning(
                                "Lilliefors (Kolmogorov-Smirnov) test for normality did not pass the p > 0.05 criterion!"
                        )
                }
        }
        
        if (lillie.test)
                return(list(
                        p.value = pval,
                        std.dev = sd_vals,
                        lillie.test = lillie_bool
                ))
        if (!lillie.test)
                return(list(
                        p.value = pval,
                        std.dev = sd_vals,
                        lillie.test = NA
                ))
}