myTAI: Evolutionary Transcriptomics

#' @title Perform Flat Line Test
#' @description This function quantifies the statistical significance of an observed phylotranscriptomic pattern. In detail, the \emph{Flat Line Test} quantifies any significant deviation of an observed phylotranscriptomic pattern from a flat line.  
#' @param ExpressionSet a standard PhyloExpressionSet or DivergenceExpressionSet object.
#' @param permutations a numeric value specifying the number of permutations that shall be performed for the \emph{FlatLineTest}.
#' @param plotHistogram a logical value indicating whether a detailed statistical analysis concerning the goodness of fit should be performed.
#' @param runs specify the number of runs to be performed for goodness of fit computations. In most cases runs = 100 is a reasonable choice.
#' @param parallel performing \code{runs} in parallel (takes all cores of your multicore machine).
#' @param custom.perm.matrix a custom \code{\link{bootMatrix}} (permutation matrix) to perform the underlying test statistic. Default is \code{custom.perm.matrix = NULL}.
#' @details Internally the function performs N phylotranscriptomics pattern computations (\code{\link{TAI}} or \code{\link{TDI}}) based on sampled PhyloExpressionSets or DivergenceExpressionSets (see \code{\link{bootMatrix}}). 
#' The test statistics is being developed as follows:
#'
#' The variance \emph{V_pattern} of the S phylotranscriptomics values defines the test statistic for the \code{\link{FlatLineTest}}. 
#' The basic assumption is, that the variance of a flat line should be equivalent to zero for a perfect flat line. 
#' Any deviation from a flat line can be measured with a variance value > 0. 
#'
#' To determine the null distribution of \emph{V_p}, all PS or DS values within each developmental stage s are randomly permuted,
#'  S surrogate phylotranscriptomics values are computed from this permuted dataset, and a surrogate value of \emph{V_p} is
#'   computed from these S phylotranscriptomics values. This permutation process is repeated N times, yielding a histogram of \emph{V_p}. 
#'
#' After applying a \emph{Lilliefors Kolmogorov-Smirnov Test for gamma distribution}, \emph{V_p} is approximated by a \emph{gamma distribution}. 
#' The two parameters of the \emph{gamma distribution} are estimated by the function \code{\link[fitdistrplus]{fitdist}} from the \pkg{fitdistrplus} package by \emph{moment matching estimation}. 
#' The fitted \emph{gamma distribution} is considered the null distribution of \emph{V_pattern}, and the p-value of the observed value of \emph{V_p} 
#' is computed from this null distribution.
#'
#' In case the parameter \emph{plotHistogram = TRUE}, a multi-plot is generated showing:
#'        
#' (1) A Cullen and Frey skewness-kurtosis plot generated by \code{\link[fitdistrplus]{descdist}}).
#'
#' (2) A histogram of V_p combined with the density plot using the Method of Moments estimated parameters returned by the \code{\link[fitdistrplus]{fitdist}} function using a gamma distribution.
#'
#' (3) A plot showing the p-values for N independent runs to verify that a specific p-value is biased by a specific permutation order.
#'
#' The \emph{goodness of fit} for the random vector \emph{V_p} is quantified statistically by an adapted Lilliefors (Kolmogorov-Smirnov) test for gamma distributions.
#' @return a list object containing the list elements:
#' \itemize{
#' \item \code{p.value} the p-value quantifying the statistical significance (deviation from a flat line) of the given phylotranscriptomics pattern.
#' \item \code{std.dev} the standard deviation of the N sampled phylotranscriptomics patterns for each developmental stage S.
#' }
#' @references 
#' 
#' Drost HG et al.(2015). \emph{Evidence for Active Maintenance of Phylotranscriptomic Hourglass Patterns in Animal and Plant Embryogenesis}. Mol Biol Evol. 32 (5): 1221-1231 doi:10.1093/molbev/msv012.
#' 
#' Quint M et al. (2012). A transcriptomic hourglass in plant embryogenesis. Nature (490): 98-101.
#'
#' M. L. Delignette-Muller, R. Pouillot, J.-B. Denis and C. Dutang (2014), fitdistrplus: help to fit of a parametric distribution to non-censored or censored data.
#'
#' Cullen AC and Frey HC (1999) Probabilistic techniques in exposure assessment. Plenum Press, USA, pp. 81-159.
#'
#' Evans M, Hastings N and Peacock B (2000) Statistical distributions. John Wiley and Sons Inc.
#'
#' Sokal RR and Rohlf FJ (1995) Biometry. W.H. Freeman and Company, USA, pp. 111-115.
#'
#' Juergen Gross and bug fixes by Uwe Ligges (2012). nortest: Tests for Normality. R package version 1.0-2. 
#' 
#' http://CRAN.R-project.org/package=nortest
#'
#' Dallal, G.E. and Wilkinson, L. (1986): An analytic approximation to the distribution of Lilliefors test for normality. The American Statistician, 40, 294-296.
#'
#' Stephens, M.A. (1974): EDF statistics for goodness of fit and some comparisons. Journal of the American Statistical Association, 69, 730-737.
#'
#' http://stackoverflow.com/questions/4290081/fitting-data-to-distributions?rq=1
#'
#' http://stats.stackexchange.com/questions/45033/can-i-use-kolmogorov-smirnov-test-and-estimate-distribution-parameters
#' 
#' http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
#'
#' http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
#'
#' @author Hajk-Georg Drost
#' @note In case there are extreme outlier expression values stored in the dataset (PhyloExpressionSet or DivergenceExpressionSet),
#'  the internal \code{\link[fitdistrplus]{fitdist}} function that is based on the \code{\link{bootMatrix}} output might return a warning:
#'   "In densfun(x, parm[1], parm[2], ...) : NaNs were produced" which indicates that permutation results caused by extreme outlier expression values 
#'   that could not be fitted accordingly. This warning will not be printed out when the corresponding outlier values are extracted from the dataset.
#' @seealso \code{\link{TAI}}, \code{\link{TDI}}, \code{\link{PlotPattern}}, \code{\link{bootMatrix}}
#' @examples
#' 
#' # read standard phylotranscriptomics data
#' data(PhyloExpressionSetExample)
#'
#' # example PhyloExpressionSet using 1000 permutations
#' FlatLineTest(PhyloExpressionSetExample, 
#'              permutations  = 1000, 
#'              plotHistogram = FALSE)
#' 
#' # use your own permutation matrix based on which p-values (FlatLineTest)
#' # shall be computed
#' custom_perm_matrix <- bootMatrix(PhyloExpressionSetExample,100)
#' 
#' FlatLineTest(PhyloExpressionSetExample,
#'              custom.perm.matrix = custom_perm_matrix)
#' 
#' @import foreach
#' @export
FlatLineTest <- function(ExpressionSet, 
                         permutations       = 1000, 
                         plotHistogram      = FALSE, 
                         runs               = 10, 
                         parallel           = FALSE,
                         custom.perm.matrix = NULL)
{
        
        is.ExpressionSet(ExpressionSet)
        
        if(plotHistogram & is.null(runs))
                stop("Please specify the number of runs to be performed for the goodness of fit computations.")
        
        nCols <- dim(ExpressionSet)[2]
        resMatrix <- matrix(NA_real_, permutations,(nCols-2))
        var_values <- vector(mode = "numeric", length = permutations)
        sd_values <- vector(mode = "numeric",length = nCols-2)
        #random_mean_age <- vector(mode = "numeric", length = permutations)
        age.real <- vector(mode = "numeric",length = nCols-2)
        age.real <- cpp_TAI(as.matrix(ExpressionSet[ , 3:nCols]),as.vector(ExpressionSet[ , 1]))
        ### compute the real variance of TAIs of the observed TAI/TDI-Hourglass pattern
        real.var <- stats::var(age.real)
        ### sample only the phylostrata (row-permutations) 
        
        if (is.null(custom.perm.matrix)){
                resMatrix <- cpp_bootMatrix(as.matrix(ExpressionSet[ , 3:nCols]),
                                            as.vector(ExpressionSet[ , 1]),
                                            as.numeric(permutations))
                
        }
        
        else if (!is.null(custom.perm.matrix)){
                
                resMatrix <- custom.perm.matrix
        }
        
        var_values <- apply(resMatrix,1,stats::var)
        #random_mean_age <- apply(resMatrix,2,mean)
        ### estimate the parameters (shape,rate) 
        ### of the gamma distributed variance values
        ### using: method of moments estimation
        gamma_MME <- fitdistrplus::fitdist(var_values,"gamma",method = "mme")
        ### estimate shape:
        shape <- gamma_MME$estimate[1]
        ### estimate the rate:
        rate <- gamma_MME$estimate[2]
        
        if(plotHistogram){
                
                gammaDensity <- function(x){
                        
                        return(stats::dgamma(x = x,shape = shape,rate = rate))
                        
                }
                
                graphics::par(mfrow = c(1,3))
                # plot a Cullen and Frey graph
                fitdistrplus::descdist(var_values, boot = permutations)
                # plot the histogram and the fitted curve
                graphics::curve( expr = gammaDensity,
                                 xlim = c(min(var_values),max(c(var_values,real.var))),
                                 col  = "steelblue",
                                 lwd  = 5,
                                 xlab = "Variances",
                                 ylab = "Frequency", 
                                 main = paste0("permutations = ",permutations) )
                
                histogram <- graphics::hist( x      = var_values,
                                             prob   = TRUE,
                                             add    = TRUE, 
                                             breaks = permutations / (0.01 * permutations) )
                graphics::rug(var_values)
                
                graphics::abline(v = real.var, lty = 1, lwd = 4, col = "darkred")
                
                p.vals_vec <- vector(mode = "numeric", length = runs)
                
                if(parallel){
                        
                        ### Parallellizing the sampling process using the 'doParallel' and 'parallel' package
                        ### register all given cores for parallelization
                        par_cores <- parallel::makeForkCluster(parallel::detectCores())
                        doParallel::registerDoParallel(par_cores)
                        
                        ### Perform the sampling process in parallel
                        p.vals_vec <- as.vector(foreach::foreach(i              = 1:runs,
                                                                 .combine       = "c",
                                                                 .errorhandling = "stop") %dopar% 
                                                        {
                                                                FlatLineTest( ExpressionSet = ExpressionSet,
                                                                              permutations = permutations )$p.value
                                                        
                                                        }
                                                )
                        
                        parallel::stopCluster(par_cores)
                        
                }
                
                if(!parallel){
                        # sequential computations of p-values 
                        # initializing the progress bar
                        #progressBar <- txtProgressBar(min = 1,max = runs,style = 3)
                                
                        
                        for(i in 1:runs){
                                
                                p.vals_vec[i] <- FlatLineTest( ExpressionSet = ExpressionSet,
                                                               permutations = permutations )$p.value
                                
                                # printing out the progress
                                #setTxtProgressBar(progressBar,i)
                        }
                }
                
                graphics::plot( x    = p.vals_vec,
                                type = "l" , 
                                lwd  = 6, 
                                ylim = c(0,1), 
                                col  = "darkblue", 
                                xlab = "Runs", 
                                ylab = "P-value", 
                                main = paste0("runs = ",runs) )
                
                graphics::abline(h = 0.05, lty = 2, lwd = 3, col = "darkred")
                
        }
        
        pval <- stats::pgamma(real.var,shape = shape,rate = rate,lower.tail = FALSE)
        sd_values <- apply(resMatrix,2,stats::sd)
        
        #cat("\n")
        return(list(p.value = pval,std.dev = sd_values))
}