#' @title Perform Flat Line Test
#' @description This function quantifies the statistical significance of an observed phylotranscriptomic pattern. In detail, the \emph{Flat Line Test} quantifies any significant deviation of an observed phylotranscriptomic pattern from a flat line.
#' @param ExpressionSet a standard PhyloExpressionSet or DivergenceExpressionSet object.
#' @param permutations a numeric value specifying the number of permutations that shall be performed for the \emph{FlatLineTest}.
#' @param plotHistogram a logical value indicating whether a detailed statistical analysis concerning the goodness of fit should be performed.
#' @param runs specify the number of runs to be performed for goodness of fit computations. In most cases runs = 100 is a reasonable choice.
#' @param parallel performing \code{runs} in parallel (takes all cores of your multicore machine).
#' @param custom.perm.matrix a custom \code{\link{bootMatrix}} (permutation matrix) to perform the underlying test statistic. Default is \code{custom.perm.matrix = NULL}.
#' @details Internally the function performs N phylotranscriptomics pattern computations (\code{\link{TAI}} or \code{\link{TDI}}) based on sampled PhyloExpressionSets or DivergenceExpressionSets (see \code{\link{bootMatrix}}).
#' The test statistics is being developed as follows:
#'
#' The variance \emph{V_pattern} of the S phylotranscriptomics values defines the test statistic for the \code{\link{FlatLineTest}}.
#' The basic assumption is, that the variance of a flat line should be equivalent to zero for a perfect flat line.
#' Any deviation from a flat line can be measured with a variance value > 0.
#'
#' To determine the null distribution of \emph{V_p}, all PS or DS values within each developmental stage s are randomly permuted,
#' S surrogate phylotranscriptomics values are computed from this permuted dataset, and a surrogate value of \emph{V_p} is
#' computed from these S phylotranscriptomics values. This permutation process is repeated N times, yielding a histogram of \emph{V_p}.
#'
#' After applying a \emph{Lilliefors Kolmogorov-Smirnov Test for gamma distribution}, \emph{V_p} is approximated by a \emph{gamma distribution}.
#' The two parameters of the \emph{gamma distribution} are estimated by the function \code{\link[fitdistrplus]{fitdist}} from the \pkg{fitdistrplus} package by \emph{moment matching estimation}.
#' The fitted \emph{gamma distribution} is considered the null distribution of \emph{V_pattern}, and the p-value of the observed value of \emph{V_p}
#' is computed from this null distribution.
#'
#' In case the parameter \emph{plotHistogram = TRUE}, a multi-plot is generated showing:
#'
#' (1) A Cullen and Frey skewness-kurtosis plot generated by \code{\link[fitdistrplus]{descdist}}).
#'
#' (2) A histogram of V_p combined with the density plot using the Method of Moments estimated parameters returned by the \code{\link[fitdistrplus]{fitdist}} function using a gamma distribution.
#'
#' (3) A plot showing the p-values for N independent runs to verify that a specific p-value is biased by a specific permutation order.
#'
#' The \emph{goodness of fit} for the random vector \emph{V_p} is quantified statistically by an adapted Lilliefors (Kolmogorov-Smirnov) test for gamma distributions.
#' @return a list object containing the list elements:
#' \itemize{
#' \item \code{p.value} the p-value quantifying the statistical significance (deviation from a flat line) of the given phylotranscriptomics pattern.
#' \item \code{std.dev} the standard deviation of the N sampled phylotranscriptomics patterns for each developmental stage S.
#' \item \code{std.dev} the Kolmogorov-Smirnov test satistics for fitting a gamma distribution to the variances of the dataset with permuted phylostrata.
#' }
#' @references
#'
#' Drost HG et al. (2015) Mol Biol Evol. 32 (5): 1221-1231 doi:10.1093/molbev/msv012
#'
#' Quint M et al. (2012). A transcriptomic hourglass in plant embryogenesis. Nature (490): 98-101.
#'
#' M. L. Delignette-Muller, R. Pouillot, J.-B. Denis and C. Dutang (2014), fitdistrplus: help to fit of a parametric distribution to non-censored or censored data.
#'
#' Cullen AC and Frey HC (1999) Probabilistic techniques in exposure assessment. Plenum Press, USA, pp. 81-159.
#'
#' Evans M, Hastings N and Peacock B (2000) Statistical distributions. John Wiley and Sons Inc.
#'
#' Sokal RR and Rohlf FJ (1995) Biometry. W.H. Freeman and Company, USA, pp. 111-115.
#'
#' Juergen Gross and bug fixes by Uwe Ligges (2012). nortest: Tests for Normality. R package version 1.0-2.
#'
#' http://CRAN.R-project.org/package=nortest
#'
#' Dallal, G.E. and Wilkinson, L. (1986): An analytic approximation to the distribution of Lilliefors test for normality. The American Statistician, 40, 294-296.
#'
#' Stephens, M.A. (1974): EDF statistics for goodness of fit and some comparisons. Journal of the American Statistical Association, 69, 730-737.
#'
#' http://stackoverflow.com/questions/4290081/fitting-data-to-distributions?rq=1
#'
#' http://stats.stackexchange.com/questions/45033/can-i-use-kolmogorov-smirnov-test-and-estimate-distribution-parameters
#'
#' http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
#'
#' http://cran.r-project.org/doc/contrib/Ricci-distributions-en.pdf
#'
#' @author Hajk-Georg Drost
#' @note In case there are extreme outlier expression values stored in the dataset (PhyloExpressionSet or DivergenceExpressionSet),
#' the internal \code{\link[fitdistrplus]{fitdist}} function that is based on the \code{\link{bootMatrix}} output might return a warning:
#' "In densfun(x, parm[1], parm[2], ...) : NaNs were produced" which indicates that permutation results caused by extreme outlier expression values
#' that could not be fitted accordingly. This warning will not be printed out when the corresponding outlier values are extracted from the dataset.
#' @seealso \code{\link{TAI}}, \code{\link{TDI}}, \code{\link{PlotSignature}}, \code{\link{bootMatrix}}
#' @examples
#'
#' # read standard phylotranscriptomics data
#' data(PhyloExpressionSetExample)
#'
#' # example PhyloExpressionSet using 100 permutations
#' FlatLineTest(PhyloExpressionSetExample,
#' permutations = 100,
#' plotHistogram = FALSE)
#'
#' # use your own permutation matrix based on which p-values (FlatLineTest)
#' # shall be computed
#' custom_perm_matrix <- bootMatrix(PhyloExpressionSetExample,100)
#'
#' FlatLineTest(PhyloExpressionSetExample,
#' custom.perm.matrix = custom_perm_matrix)
#'
#' @import foreach
#' @export
#'
#'
#'
FlatLineTest <- function(ExpressionSet,
permutations = 10000,
plotHistogram = FALSE,
runs = 10,
parallel = FALSE,
custom.perm.matrix = NULL)
{
is.ExpressionSet(ExpressionSet)
if (plotHistogram & is.null(runs))
stop(
"Please specify the number of runs to be performed for the goodness of fit computations.",
call. = FALSE
)
nCols <- dim(ExpressionSet)[2]
resMatrix <- matrix(NA_real_, permutations, (nCols - 2))
var_values <- vector(mode = "numeric", length = permutations)
sd_values <-vector(mode = "numeric", length = nCols - 2)
#random_mean_age <- vector(mode = "numeric", length = permutations)
age.real <- vector(mode = "numeric", length = nCols - 2)
age.real <-
cpp_TAI(as.matrix(dplyr::select(ExpressionSet, 3:ncol(ExpressionSet))), as.vector(unlist(dplyr::select(ExpressionSet, 1))))
### compute the real variance of TAIs of the observed TAI/TDI-Hourglass pattern
real.var <- stats::var(age.real)
### sample only the phylostrata (row-permutations)
if (is.null(custom.perm.matrix)) {
start_time = Sys.time()
resMatrix <-
cpp_bootMatrix(as.matrix(dplyr::select(
ExpressionSet, 3:ncol(ExpressionSet)
)),
as.vector(unlist(dplyr::select(
ExpressionSet, 1
))),
as.numeric(permutations))
end_time = Sys.time()
cat("\n")
cat(paste("Total runtime of your permutation test:", round(end_time - start_time, 3), " seconds."))
}
else if (!is.null(custom.perm.matrix)) {
resMatrix <- custom.perm.matrix
}
var_values <- apply(resMatrix, 1, stats::var)
#print(sum(var_values > real.var)/length(var_values))
#random_mean_age <- apply(resMatrix,2,mean)
### estimate the parameters (shape,rate)
### of the gamma distributed variance values
### using: method of moments estimation
gamma_MME = GetGamma(var_values, permutations)
### estimate shape:
shape <- gamma_MME[[1]]
### estimate the rate:
rate <- gamma_MME[[2]]
ks_test = gamma_MME[[3]]
# in case the fitting fails
if (shape == 0 & rate == 0) {
gamma = fitdistrplus::fitdist(var_values, "gamma", method = "mme")
shape = gamma$estimate[1]
rate = gamma$estimate[2]
suppressWarnings(ks_test <- stats::ks.test(var_values, "pgamma", shape = shape, rate = rate))
}
if (permutations < 20000) {
message("\n")
message(
"-> We recommended using at least 20000 permutations to achieve a sufficient permutation test."
)
}
if (plotHistogram) {
gammaDensity <- function(x) {
return(stats::dgamma(
x = x,
shape = shape,
rate = rate
))
}
graphics::par(mfrow = c(1, 3))
# plot a Cullen and Frey graph
fitdistrplus::descdist(var_values, boot = permutations)
# plot the histogram and the fitted curve
graphics::curve(
expr = gammaDensity,
xlim = c(min(c(
var_values, real.var
)), max(c(
var_values, real.var
))),
col = "steelblue",
lwd = 5,
xlab = "Variances",
ylab = "Frequency",
main = paste0("permutations = ", permutations)
)
histogram <- graphics::hist(
x = var_values,
prob = TRUE,
add = TRUE,
breaks = permutations / (0.01 * permutations)
)
graphics::rug(var_values)
graphics::abline(
v = real.var,
lty = 1,
lwd = 4,
col = "darkred"
)
p.vals_vec <- vector(mode = "numeric", length = runs)
if (parallel) {
### Parallellizing the sampling process using the 'doParallel' and 'parallel' package
### register all given cores for parallelization
par_cores <-
parallel::makeForkCluster(parallel::detectCores())
doParallel::registerDoParallel(par_cores)
### Perform the sampling process in parallel
p.vals_vec <-
as.vector(foreach::foreach(
i = 1:runs,
.combine = "c",
.errorhandling = "stop"
) %dopar%
{
FlatLineTest(ExpressionSet = ExpressionSet,
permutations = permutations)$p.value
})
parallel::stopCluster(par_cores)
}
if (!parallel) {
# sequential computations of p-values
# initializing the progress bar
#progressBar <- txtProgressBar(min = 1,max = runs,style = 3)
for (i in 1:runs) {
p.vals_vec[i] <- FlatLineTest(ExpressionSet = ExpressionSet,
permutations = permutations)$p.value
# printing out the progress
#setTxtProgressBar(progressBar,i)
}
}
graphics::plot(
x = p.vals_vec,
type = "l" ,
lwd = 6,
ylim = c(0, 1),
col = "darkblue",
xlab = "Runs",
ylab = "P-value",
main = paste0("runs = ", runs)
)
graphics::abline(
h = 0.05,
lty = 2,
lwd = 3,
col = "darkred"
)
}
pval <-
stats::pgamma(real.var,
shape = shape,
rate = rate,
lower.tail = FALSE)
sd_values <- apply(resMatrix, 2, stats::sd)
return(list(
p.value = pval,
std.dev = sd_values,
ks.test = ks_test
))
}
GetGamma <- function(var_values, permutations)
{
iterations = 200
max_cut = 0.25
step = max_cut/iterations
sorted_vars = sort(var_values, decreasing = TRUE)
max_p_fit_v = 0
max_p_i = 0
for (i in 2:iterations) {
# to avoid indexing from zero
# Filtered variances
filtered_vars <-
sorted_vars[round(length(var_values) * i * step):length(var_values)]
# Estimate parameters using method of moments
gamma_fit <-
fitdistrplus::fitdist(filtered_vars, "gamma", method = "mme")
shape <- gamma_fit$estimate[1]
rate <- gamma_fit$estimate[2]
# Perform Kolmogorov-Smirnov test
suppressWarnings(ks_result <-
stats::ks.test(filtered_vars,
"pgamma",
shape = shape,
rate = rate))
if (ks_result$p.value > max_p_fit_v) {
max_p_i = i
max_p_fit_v = ks_result$p.value
}
}
if (max_p_i == 0) {
gamma_fit <-
fitdistrplus::fitdist(var_values, "gamma", method = "mme")
return(list(gamma_fit$estimate[1], gamma_fit$estimate[2],suppressWarnings(ks_result <-
stats::ks.test(var_values,
"pgamma",
shape = gamma_fit$estimate[1],
rate = gamma_fit$estimate[2]))))
}
b_shape = 0
b_rate = 0
ks_best = NULL
max_p_fit_v = 0
for (i in -10:10) {
# Filtered variances
lb = round(length(var_values) * (max_p_i * step + i * step / 10))
filtered_vars <-
sorted_vars[lb:length(var_values)]
# Estimate parameters using method of moments
gamma_fit <-
fitdistrplus::fitdist(filtered_vars, "gamma", method = "mme")
shape <- gamma_fit$estimate[1]
rate <- gamma_fit$estimate[2]
# Perform Kolmogorov-Smirnov test
suppressWarnings(ks_result <-
stats::ks.test(filtered_vars,
"pgamma",
shape = shape,
rate = rate))
if (ks_result$p.value > max_p_fit_v) {
max_p_fit_v = ks_result$p.value
b_shape = shape
b_rate = rate
ks_best = ks_result
}
}
return(list(b_shape, b_rate, ks_best))
}
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.