#' @title Perform Flat Line Test
#' @description This function quantifies the statistical significance of an observed phylotranscriptomic pattern. In detail, the \emph{Flat Line Test} quantifies any significant deviation of an observed phylotranscriptomic pattern from a flat line.
#' @param ExpressionSet a standard PhyloExpressionSet or DivergenceExpressionSet object.
#' @param permutations a numeric value specifying the number of permutations that shall be performed for the \emph{FlatLineTest}.
#' @param plotHistogram a logical value indicating whether a detailed statistical analysis concerning the goodness of fit should be performed.
#' @param runs specify the number of runs to be performed for goodness of fit computations. In most cases runs = 100 is a reasonable choice.
#' @param parallel performing \code{runs} in parallel (takes all cores of your multicore machine).
#' @param custom.perm.matrix a custom \code{\link{bootMatrix}} (permutation matrix) to perform the underlying test statistic. Default is \code{custom.perm.matrix = NULL}.
#' @details Internally the function performs N phylotranscriptomics pattern computations (\code{\link{TAI}} or \code{\link{TDI}}) based on sampled PhyloExpressionSets or DivergenceExpressionSets (see \code{\link{bootMatrix}}).
#' The test statistics is being developed as follows:
#'
#' The variance \emph{V_pattern} of the S phylotranscriptomics values defines the test statistic for the \code{\link{FlatLineTest}}.
#' The basic assumption is, that the variance of a flat line should be equivalent to zero for a perfect flat line.
#' Any deviation from a flat line can be measured with a variance value > 0.
#'
#' To determine the null distribution of \emph{V_p}, all PS or DS values within each developmental stage s are randomly permuted,
#' S surrogate phylotranscriptomics values are computed from this permuted dataset, and a surrogate value of \emph{V_p} is
#' computed from these S phylotranscriptomics values. This permutation process is repeated N times, yielding a histogram of \emph{V_p}.
#'
#' After applying a \emph{Lilliefors Kolmogorov-Smirnov Test for gamma distribution}, \emph{V_p} is approximated by a \emph{gamma distribution}.
#' The two parameters of the \emph{gamma distribution} are estimated by the function \code{\link[fitdistrplus]{fitdist}} from the \pkg{fitdistrplus} package by \emph{moment matching estimation}.
#' The fitted \emph{gamma distribution} is considered the null distribution of \emph{V_pattern}, and the p-value of the observed value of \emph{V_p}
#' is computed from this null distribution.
#'
#' In case the parameter \emph{plotHistogram = TRUE}, a multi-plot is generated showing:
#'
#' (1) A Cullen and Frey skewness-kurtosis plot generated by \code{\link[fitdistrplus]{descdist}}).
#'
#' (2) A histogram of V_p combined with the density plot using the Method of Moments estimated parameters returned by the \code{\link[fitdistrplus]{fitdist}} function using a gamma distribution.
#'
#' (3) A plot showing the p-values for N independent runs to verify that a specific p-value is biased by a specific permutation order.
#'
#' The \emph{goodness of fit} for the random vector \emph{V_p} is quantified statistically by an adapted Lilliefors (Kolmogorov-Smirnov) test for gamma distributions.
#' @return a list object containing the list elements:
#' \itemize{
#' \item \code{p.value} the p-value quantifying the statistical significance (deviation from a flat line) of the given phylotranscriptomics pattern.
#' \item \code{std.dev} the standard deviation of the N sampled phylotranscriptomics patterns for each developmental stage S.
#' }
#' @references
#'
#' Drost HG et al.(2015). \emph{Evidence for Active Maintenance of Phylotranscriptomic Hourglass Patterns in Animal and Plant Embryogenesis}. Mol Biol Evol. 32 (5): 1221-1231 doi:10.1093/molbev/msv012.
#'
#' Quint M et al. (2012). A transcriptomic hourglass in plant embryogenesis. Nature (490): 98-101.
#'
#' M. L. Delignette-Muller, R. Pouillot, J.-B. Denis and C. Dutang (2014), fitdistrplus: help to fit of a parametric distribution to non-censored or censored data.
#'
#' Cullen AC and Frey HC (1999) Probabilistic techniques in exposure assessment. Plenum Press, USA, pp. 81-159.
#'
#' Evans M, Hastings N and Peacock B (2000) Statistical distributions. John Wiley and Sons Inc.
#'
#' Sokal RR and Rohlf FJ (1995) Biometry. W.H. Freeman and Company, USA, pp. 111-115.
#'
#' Juergen Gross and bug fixes by Uwe Ligges (2012). nortest: Tests for Normality. R package version 1.0-2.
#'
#' http://CRAN.R-project.org/package=nortest
#'
#' Dallal, G.E. and Wilkinson, L. (1986): An analytic approximation to the distribution of Lilliefors test for normality. The American Statistician, 40, 294-296.
#'
#' Stephens, M.A. (1974): EDF statistics for goodness of fit and some comparisons. Journal of the American Statistical Association, 69, 730-737.
#'
#' http://stackoverflow.com/questions/4290081/fitting-data-to-distributions?rq=1
#'
#' http://stats.stackexchange.com/questions/45033/can-i-use-kolmogorov-smirnov-test-and-estimate-distribution-parameters
#'
#' http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
#'
#' http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
#'
#' @author Hajk-Georg Drost
#' @note In case there are extreme outlier expression values stored in the dataset (PhyloExpressionSet or DivergenceExpressionSet),
#' the internal \code{\link[fitdistrplus]{fitdist}} function that is based on the \code{\link{bootMatrix}} output might return a warning:
#' "In densfun(x, parm[1], parm[2], ...) : NaNs were produced" which indicates that permutation results caused by extreme outlier expression values
#' that could not be fitted accordingly. This warning will not be printed out when the corresponding outlier values are extracted from the dataset.
#' @seealso \code{\link{TAI}}, \code{\link{TDI}}, \code{\link{PlotPattern}}, \code{\link{bootMatrix}}
#' @examples
#'
#' # read standard phylotranscriptomics data
#' data(PhyloExpressionSetExample)
#'
#' # example PhyloExpressionSet using 1000 permutations
#' FlatLineTest(PhyloExpressionSetExample,
#' permutations = 1000,
#' plotHistogram = FALSE)
#'
#' # use your own permutation matrix based on which p-values (FlatLineTest)
#' # shall be computed
#' custom_perm_matrix <- bootMatrix(PhyloExpressionSetExample,100)
#'
#' FlatLineTest(PhyloExpressionSetExample,
#' custom.perm.matrix = custom_perm_matrix)
#'
#' @import foreach
#' @export
FlatLineTest <- function(ExpressionSet,
permutations = 1000,
plotHistogram = FALSE,
runs = 10,
parallel = FALSE,
custom.perm.matrix = NULL)
{
is.ExpressionSet(ExpressionSet)
if(plotHistogram & is.null(runs))
stop("Please specify the number of runs to be performed for the goodness of fit computations.")
nCols <- dim(ExpressionSet)[2]
resMatrix <- matrix(NA_real_, permutations,(nCols-2))
var_values <- vector(mode = "numeric", length = permutations)
sd_values <- vector(mode = "numeric",length = nCols-2)
#random_mean_age <- vector(mode = "numeric", length = permutations)
age.real <- vector(mode = "numeric",length = nCols-2)
age.real <- cpp_TAI(as.matrix(ExpressionSet[ , 3:nCols]),as.vector(ExpressionSet[ , 1]))
### compute the real variance of TAIs of the observed TAI/TDI-Hourglass pattern
real.var <- stats::var(age.real)
### sample only the phylostrata (row-permutations)
if (is.null(custom.perm.matrix)){
resMatrix <- cpp_bootMatrix(as.matrix(ExpressionSet[ , 3:nCols]),
as.vector(ExpressionSet[ , 1]),
as.numeric(permutations))
}
else if (!is.null(custom.perm.matrix)){
resMatrix <- custom.perm.matrix
}
var_values <- apply(resMatrix,1,stats::var)
#random_mean_age <- apply(resMatrix,2,mean)
### estimate the parameters (shape,rate)
### of the gamma distributed variance values
### using: method of moments estimation
gamma_MME <- fitdistrplus::fitdist(var_values,"gamma",method = "mme")
### estimate shape:
shape <- gamma_MME$estimate[1]
### estimate the rate:
rate <- gamma_MME$estimate[2]
if(plotHistogram){
gammaDensity <- function(x){
return(stats::dgamma(x = x,shape = shape,rate = rate))
}
graphics::par(mfrow = c(1,3))
# plot a Cullen and Frey graph
fitdistrplus::descdist(var_values, boot = permutations)
# plot the histogram and the fitted curve
graphics::curve( expr = gammaDensity,
xlim = c(min(var_values),max(c(var_values,real.var))),
col = "steelblue",
lwd = 5,
xlab = "Variances",
ylab = "Frequency",
main = paste0("permutations = ",permutations) )
histogram <- graphics::hist( x = var_values,
prob = TRUE,
add = TRUE,
breaks = permutations / (0.01 * permutations) )
graphics::rug(var_values)
graphics::abline(v = real.var, lty = 1, lwd = 4, col = "darkred")
p.vals_vec <- vector(mode = "numeric", length = runs)
if(parallel){
### Parallellizing the sampling process using the 'doParallel' and 'parallel' package
### register all given cores for parallelization
par_cores <- parallel::makeForkCluster(parallel::detectCores())
doParallel::registerDoParallel(par_cores)
### Perform the sampling process in parallel
p.vals_vec <- as.vector(foreach::foreach(i = 1:runs,
.combine = "c",
.errorhandling = "stop") %dopar%
{
FlatLineTest( ExpressionSet = ExpressionSet,
permutations = permutations )$p.value
}
)
parallel::stopCluster(par_cores)
}
if(!parallel){
# sequential computations of p-values
# initializing the progress bar
#progressBar <- txtProgressBar(min = 1,max = runs,style = 3)
for(i in 1:runs){
p.vals_vec[i] <- FlatLineTest( ExpressionSet = ExpressionSet,
permutations = permutations )$p.value
# printing out the progress
#setTxtProgressBar(progressBar,i)
}
}
graphics::plot( x = p.vals_vec,
type = "l" ,
lwd = 6,
ylim = c(0,1),
col = "darkblue",
xlab = "Runs",
ylab = "P-value",
main = paste0("runs = ",runs) )
graphics::abline(h = 0.05, lty = 2, lwd = 3, col = "darkred")
}
pval <- stats::pgamma(real.var,shape = shape,rate = rate,lower.tail = FALSE)
sd_values <- apply(resMatrix,2,stats::sd)
#cat("\n")
return(list(p.value = pval,std.dev = sd_values))
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.