knitr::opts_chunk$set(tidy = FALSE, cache = FALSE, autodep = TRUE)
High throughput data is very common in modern science. The main property of these data is high-dimensionality; that is, the number of features is larger than the number of observations. There are many ways to study this kind of data, and multiple hypothesis testing is one of them. In a multiple hypothesis test, generally a list of p-values $(p_i)$ are calculated, one for each hypothesis $(H_i)$ corresponding to one feature; the p-values are then compared with one or more predefined fixed or random thresholds to obtain the number of significant features. The key goal is to control the Family Wise Error Rate ($FWER$) or False Discovery Rate ($FDR$) while maximizing the power of the tests.
Although multiple hypotheses provide platforms to test many features simultaneously, they often require high compensation for doing so [@stephens2016false]. To overcome this shortcoming, BH[@benjamini1997false] shows a way of using the rank of the p-values frequently termed adjusted p-values. However, this method solely depends on the p-values and, therefore, provides only sub optimal power of the tests. An alternative approach is p-value weighting in which external information is used in terms of weight, and the actual p-value is redefined by incorporating the weight, which is usually accomplished by dividing the original p-values by the corresponding weights and creating new p-values called weighted p-values, i.e., $p_i^{w_i} = \frac{p_i}{w_i}; i = 1, ...., m$, where $p_i^{w_i}$ and $w_i$ refer to the weighted p-value and the corresponding weight. This external information is frequently referred to the covariates $(y_i)$. The requirements of the method are that the covariates are independent under the null hypothesis but informative for the power [@bourgon2010independent]. In addition, the weights must be non-negative, and the mean of the weights must be equal to $1$.
Generally, covariates provide different prior probabilities of the null hypotheses being true; therefore, a judiciously chosen covariate can significantly improve the power of the test while maintaining the error rate below the threshold. Such covariates are frequently available from various studies and data sets [@ignatiadis2016natmeth]. In this vignette, we discussed an application of a newly proposed optimal p-value weighting named Covariate Rank Weighting (CRW). In the article, we showed how to compute an optimal weight of the p-values without estimating the effect sizes of the tests. We showed that the weights can be computed via a probabilistic relationship of the ranking of the tests and the effect sizes. With regard to $CRW$, we developed an $R$ package named r Biocpkg("OPWeight")
, and we will discuss the application procedures of the functions of the package. To apply the function, the essential inputs are:
i) a vector of p-values ii) a vector of covariates, where each value corresponds to a p-value
and the auxiliary inputs are
i) the probabilities of the rank of the test statistics given the mean effect size iii) mean of the covariate and/or test effect sizes iv) a significance level of $\alpha$ at which $FWER$ or $FDR$ will be controlled
The proposed $CRW$ method uses covariates to computes the probability of the ranks of the test statistics being higher than any other tests given the mean effect size, $p(r_i \mid \varepsilon_i)$, then compute the weights from the probability corresponding to the p-values. As an example, we will discuss RNA-seq differential gene expression data called r Biocpkg("airway")
from the $R$ library r Biocpkg("airway")
. This data set is also used in the $IHW$ package vignettes [@ignatiadis2016natmeth].
We fist preprocess the r Biocpkg("airway")
RNA-Seq data set using r Biocpkg("DESeq2")
[@love2014moderated] to obtain the p-values, test statistics, and the covariates.
Load required libraries
# install OPWeight package if (!requireNamespace("BiocManager", quietly = TRUE)) install.packages("BiocManager") BiocManager::install("OPWeight") # load packages library(OPWeight) library(ggplot2) library(airway) library(DESeq2) library(cowplot) library(tibble) library(qvalue) library(MASS)
Process RNA-seq data
data("airway", package = "airway") dds <- DESeqDataSet(se = airway, design = ~ cell + dex) dds <- DESeq(dds) de_res_air <- as.data.frame(results(dds))
The output is a r class(de_res_air)
object containing the following columns for each gene in which each gene corresponds to a hypothesis test:
colnames(de_res_air) dim(de_res_air)
From the above columns, we will consider baseMean
as the covariate. In the r Biocpkg("DESeq2")
paper, it was argued that the covariates baseMean
and the test statistics stat
are approximately independent under the null hypothesis, which satisfied the requirements for the baseMean
to be the covariates.
First we will load the proposed package r Biocpkg("OPWeight")
and then show the steps to conduct $CRW$ tests:
# add a small constant to make all values positive covariates = de_res_air$baseMean + .0001 set.seed(123) # one should use more nrep to obtain more acurate results opw_results <- opw(pvalue = de_res_air$pvalue, covariate = covariates, nrep = 1000, alpha = .1, tail = 2, effectType = "continuous", method = "BH")
**Figure 1:** The above plot shows the estimated $\lambda$ of the box-cox transformation and the corresponding $log-likelihood.$
The executed function $opw()$ returns a list of objects:
names(opw_results)
For example, the estimated proportion of the true null hypothesis:
opw_results$nullProp
The number of rejected null hypothesis:
opw_results$rejections
The plots of the probabilities, $P(r_i \mid \varepsilon_i)$, and the corresponding weights, $w_i$:
m = opw_results$totalTests testRanks = 1:m probs = opw_results$dataOut$ranksProb weights = opw_results$dataOut$weight dat = tibble(testRanks, probs, weights) p_prob = ggplot(dat, aes(x = testRanks, y = probs)) + geom_point(size = .5, col="firebrick4") + labs(x = "Test rank" , y = "p(rank | effect)") p_weight = ggplot(dat, aes(x = testRanks, y = weights)) + geom_point(size = .5, col="firebrick4")+ labs(x = "Test rank" , y = "Weight") plot_grid(p_prob, p_weight, labels = c("a", "b"))
**Figure 2:** Rank probability (a) and Weight (b) versus Test rank
If one wants to test $H_0: \varepsilon_i=0$ vs. $H_0: \varepsilon_i > 0$, i.e., the effect sizes follow continuous distribution, then one should use effectType = "continuous"
. Similarly, if one wants to test $H_0: \varepsilon_i=0$ vs. $H_0: \varepsilon_i=\varepsilon$, i.e., effect sizes are binary; $0$ under the null model and a fixed value $\varepsilon$ under the alternative model, one should use effectType = "binary"
. $opw()$ function can provide results based on $FDR$ or $FWER$, and the default method is Benjamini-Hochberg [@benjamini1997false] $FDR$ method.
The main goal of the function is to compute the probabilities of the ranks from
the p-values and the covariates, consequently computing the weights. Although weights
and
ranksProb
are optional, $opw$ has the options so that one can compute the probabilities
and the weights externally if necessary (see the Data analysis section).
Other parameters that we did not use in the $opw()$ function can be discussed here. Internally, the function computes the ranksprob
and consequently the weights, then uses the p-values to formulate conclusions about the hypotheses. Therefore, if ranksprob
is given then mean_covariateEffect
is redundant and should not be provided to the function. Although ranksprob
is not required for the function, one can compute ranksprob
by using the function prob_rank_givenEffect.
The function $opw()$ internally computes mean_covariateEffect
and mean_testEffect
from a simple linear regression with box-cox transformation between the test and covariate statistics, where the covariates are regressed on the test statistics. Then the estimated mean_covariateEffect
and mean_testEffect
are used to obtain the ranksprob
and the weights. Thus, in order to apply the function properly, it is crucial to understand the uses of the parameters mean_covariateEffect
and mean_testEffect
. Note that, covariates need to be positive to apply $box-cox$ from the R library r Biocpkg("MASS")
. If the covariates are less than or equal to zero add small positive values with the covariates or externally supply mean_covariateEffect
and mean_testEffect
(see the data analysis section).
If mean_covariateEffect
and mean_testEffect
are not provided, then the test statistics
computed from the p-values will be used to obtain the relationship between the covariates
and the test statistics to compute the mean effects. If one of the mean effects
is not given, then the missing mean effect will be computed internally.
There are many ways to obtain the mean of the test effects, mean_testEffect
, and the corresponding value of the covariate effects, mean_covariateEffect
; however, in the proposed $R$ function $opw()$, we used a simple linear regression with box-cox transformation. Sometimes the box-cox transformation may not be the optimal choice or one wants to use a different model to obtain the relationship. In that situation, one can acquire the relationship between the covariates and test statistics externally then provide ranksprob
or mean_covariateEffect
and mean_testEffect
to perform the proposed optimal p-value weighting. In the following section, we will show the details analysis procedure.
Before applying the methods, a pre-screening analysis of the data set should be performed. Consider the above data set. First, let us generate some plots based on the covariates and the p-values.
Data <- tibble(pval=de_res_air$pvalue, covariate=de_res_air$baseMean) hist_pval <- ggplot(Data, aes(x = Data$pval)) + geom_histogram(colour = "#1F3552", fill = "#4281AE")+ labs(x = "P-values") hist_covariate <- ggplot(Data, aes(x = Data$covariate)) + geom_histogram( colour = "#1F3552", fill = "#4274AE")+ labs(x = "Covariate statistics") pval_covariate <- ggplot(Data, aes(x = rank(-Data$covariate), y = -log10(pval))) + geom_point()+ labs(x = "Ranks of covariates", y = "-log(pvalue)") p_ecdf <- ggplot(Data, aes(x = pval)) + stat_ecdf(geom = "step")+ labs(x = "P-values", title="Empirical cumulative distribution")+ theme(plot.title = element_text(size = rel(.7))) p <- plot_grid(hist_pval, hist_covariate, pval_covariate, p_ecdf, labels = c("a", "b", "c", "d"), ncol = 2) # now add the title title <- ggdraw() + draw_label("Airway: data summary") plot_grid(title, p, ncol = 1, rel_heights=c(0.1, 1))
Figure 3: (a) Distribution of the p-values, (b) distribution of the covariates, (c) relationship between the p-values and the rank of the covariate statistics, and (d) empirical cumulative distribution of the p-values.
From the pre-analysis plots, we see the distribution of the covariates is rightly skewed; therefore, to perform a linear regression, the covariates need to be transformed. In $opw()$ function, we used box-cox transformation. However, one can perform another transformation, such as $log(covariate)$, and fit a simple linear regression model, or apply other methods to obtain the relationship, for example, a generalized linear model. Although the transformed distribution still may not be suitable, this may not be a severe problem as long as the model fits the center of the data well because the proposed method only requires the center of the distribution.
We also observe from the plot that there is a weak relationship between the test statistics and the covariates; however, the ranked-covariate statistics show a potential relationship with the p-values, i.e., low p-values are enriched at the higher covariates. This relationship might be informative for the p-value-weighting because we are more interested in the ranking of the covariates. Enrichment of the low p-values at higher covariates also indicates that the covariates are correlated with the power under the alternative model. Furthermore, the bimodal distribution of the p-values indicates two-tailed test criteria are necessary because p-values close to $1$ are the cases that are significant in the opposite direction. We also observed the empirical cumulative distribution of the p-values. The empirical cumulative distribution shows that the curve is almost linear for the high p-values, which reveals the lesser importance of the higher p-values, and the size of the low p-values is very small.
We first fit a linear regression with box-cox transformation to obtain the estimated mean_covariateEffect
and mean_testEffect
in the following:
# initial stage-------- pvals = de_res_air$pvalue tests = qnorm(pvals/2, lower.tail = FALSE) covariates = de_res_air$baseMean + .0001 # to ensure covariates are postive for the box-cox # formulate a data set------------- Data = tibble(pvals, covariates) OD <- Data[order(Data$covariates, decreasing=TRUE), ] Ordered.pvalue <- OD$pvals # estimate the true null and alternative test sizes------ m = length(Data$pvals); m nullProp = qvalue(Data$pvals, pi0.method="bootstrap")$pi0; nullProp m0 = ceiling(nullProp*m); m0 m1 = m - m0; m1 # etimated test efects of the true altrnatives------------ # keep top m1 tests for regression --------- tests[which(!is.finite(tests))] <- NA Data2 = add_column(Data, tests) OD2 <- Data2[order(Data2$tests, decreasing=TRUE), ][1:m1, ] test_effect_vec <- OD2$tests # fit box-cox regression #-------------------------------- bc <- MASS::boxcox(covariates ~ tests, data = OD2) lambda <- bc$x[which.max(bc$y)] model <- lm(covariates**lambda ~ tests, data = OD2) # for the continuous effects etimated mean effects mean_testEffect = mean(test_effect_vec, na.rm = TRUE) mean_testEffect mean_covariateEffect = model$coef[[1]] + model$coef[[2]]*mean_testEffect mean_covariateEffect
In order to obtain the mean effects, we compute the predicted value of the covariates corresponding to the mean or median of the test statistics. The mean or median value of the test statistics is then used as the estimated mean_testEffect
, and the corresponding predicted value of the covariates is used as the estimated mean_covariateEffect
.
Let us now compute the probabilities of the ranks of the covariates given the mean effect
size computed above. Note that for the ranks probability, et
and ey
are the same because the mean effect of the covariates is only a single value.
set.seed(123) # compute ranks probability prob_cont <- sapply(1:m, prob_rank_givenEffect, et = mean_covariateEffect, ey = mean_covariateEffect, nrep = 1000, m0 = m0, m1 = m1)
Then applying the probabilities to compute the weights
# compute weihgts wgt = weight_continuous_nwt(alpha = .1, et = mean_testEffect, ranksProb = prob_cont)$w
Finally, obtain the total number and the list of the significant tests
alpha = .1 padj <- p.adjust(Ordered.pvalue/wgt, method = "BH") rejections_list = OD[which((padj <= alpha) == TRUE), ] n_rejections = dim(rejections_list)[1] n_rejections
or apply $opw$ function to compute the final results
opw_results2 <- opw(pvalue = pvals, covariate = covariates, weight = wgt, effectType = "continuous", alpha = .1, method = "BH") opw_results2$rejections
The above procedures are performed by the $opw()$ function internally by default;
however, if one wants to try a different approach such as fitting a generalized
linear model instead of the box-cox transformation, then one can apply a different
model and follow similar procedures to compute the parameters mean_covariateEffect
and
mean_testEffect
, or weight
or ranksProb
, then apply the results in $opw()$ function. Sometime it is necessary to perform parallel computing, especially if the number of the hypothesis tests is large (e.g. millions). In this context, the following data analysis procedure could be convenient instead of using the $opw()$ function directly.
In the package r Biocpkg("OPweight")
, there are other useful functions: 1) prob_rank_givenEffect, 2) weight_binary_nwt, 3) weight_continuous_nwt, 4) weight_binary, 5) weight_continuous, 6) weight_by_delta, 7) prob_rank_givenEffect_approx, 8) prob_rank_givenEffect_exact*, and 9) prob_rank_givenEffect_simu*. The first three functions are used inside the main function $opw()$. Functions $2$ and $3$ are based on the Newton-Raphson algorithm, and hence, are very fast. However, this algorithm depends on a correct guess of the initial value. The function internally adjusts the initial value. One can also provide the initial value externally.
Functions $4$ and $5$ are based on the grid search algorithm; therefore, it is very slow. However, these functions always work and can be used as replacements for functions $2$ and $3$.
The Sixth function is used inside Functions $4$ and $5$ and the remaining functions are provided if anyone wishes to see the behavior of the ranks probability of the test statistic, $p(r_i \mid \varepsilon_i)$. In the original article, we proposed the probability method and showed an exact mathematical formula, then verified the formula by simulations. The exact method requires intensive computation; therefore, we proposed an approximation. By applying the later three functions one can easily observe that all approaches perform similarly. However, for a large number of tests, the simulation and the exact approach are computationally very expensive; therefore, the approximation method is a better option. In fact, we actually need $p(r_i \mid E(\varepsilon_i))$, which is obtained by the function prob_rank_givenEffect. For the details see Hasan and Schliekelman, (2017) and the corresponding supplementary materials.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.