In mshasan/OPWeight: Optimal p-value weighting with independent information

knitr::opts_chunk$set(tidy = FALSE, cache = FALSE, autodep = TRUE)

Introduction

High throughput data is very common in modern science. The main property of these data is high-dimensionality; that is, the number of features is larger than the number of observations. There are many ways to study this kind of data, and multiple hypothesis testing is one of them. In a multiple hypothesis test, generally a list of p-values $(p_i)$ are calculated, one for each hypothesis $(H_i)$ corresponding to one feature; the p-values are then compared with one or more predefined fixed or random thresholds to obtain the number of significant features. The key goal is to control the Family Wise Error Rate ($FWER$) or False Discovery Rate ($FDR$) while maximizing the power of the tests.

Although multiple hypotheses provide platforms to test many features simultaneously, they often require high compensation for doing so [@stephens2016false]. To overcome this shortcoming, BH[@benjamini1997false] shows a way of using the rank of the p-values frequently termed adjusted p-values. However, this method solely depends on the p-values and, therefore, provides only sub optimal power of the tests. An alternative approach is p-value weighting in which external information is used in terms of weight, and the actual p-value is redefined by incorporating the weight, which is usually accomplished by dividing the original p-values by the corresponding weights and creating new p-values called weighted p-values, i.e., $p_i^{w_i} = \frac{p_i}{w_i}; i = 1, ...., m$, where $p_i^{w_i}$ and $w_i$ refer to the weighted p-value and the corresponding weight. This external information is frequently referred to the covariates $(y_i)$. The requirements of the method are that the covariates are independent under the null hypothesis but informative for the power [@bourgon2010independent]. In addition, the weights must be non-negative, and the mean of the weights must be equal to $1$.

Generally, covariates provide different prior probabilities of the null hypotheses being true; therefore, a judiciously chosen covariate can significantly improve the power of the test while maintaining the error rate below the threshold. Such covariates are frequently available from various studies and data sets [@ignatiadis2016natmeth]. In this vignette, we discussed an application of a newly proposed optimal p-value weighting named Covariate Rank Weighting (CRW). In the article, we showed how to compute an optimal weight of the p-values without estimating the effect sizes of the tests. We showed that the weights can be computed via a probabilistic relationship of the ranking of the tests and the effect sizes. With regard to $CRW$, we developed an $R$ package named r Biocpkg("OPWeight"), and we will discuss the application procedures of the functions of the package. To apply the function, the essential inputs are:

i) a vector of p-values ii) a vector of covariates, where each value corresponds to a p-value

and the auxiliary inputs are

i) the probabilities of the rank of the test statistics given the mean effect size iii) mean of the covariate and/or test effect sizes iv) a significance level of $\alpha$ at which $FWER$ or $FDR$ will be controlled

The proposed $CRW$ method uses covariates to computes the probability of the ranks of the test statistics being higher than any other tests given the mean effect size, $p(r_i \mid \varepsilon_i)$, then compute the weights from the probability corresponding to the p-values. As an example, we will discuss RNA-seq differential gene expression data called r Biocpkg("airway") from the $R$ library r Biocpkg("airway"). This data set is also used in the $IHW$ package vignettes [@ignatiadis2016natmeth].

Airway RNA-seq data example

We fist preprocess the r Biocpkg("airway") RNA-Seq data set using r Biocpkg("DESeq2") [@love2014moderated] to obtain the p-values, test statistics, and the covariates.

Load required libraries

# install OPWeight package
if (!requireNamespace("BiocManager", quietly = TRUE))
    install.packages("BiocManager")
BiocManager::install("OPWeight")

# load packages
library(OPWeight)
library(ggplot2)
library(airway)
library(DESeq2)
library(cowplot)
library(tibble)
library(qvalue)
library(MASS)

Process RNA-seq data

data("airway", package = "airway")
dds <- DESeqDataSet(se = airway, design = ~ cell + dex)
dds <- DESeq(dds)
de_res_air <- as.data.frame(results(dds))

The output is a r class(de_res_air) object containing the following columns for each gene in which each gene corresponds to a hypothesis test:

colnames(de_res_air)
dim(de_res_air)

From the above columns, we will consider baseMean as the covariate. In the r Biocpkg("DESeq2") paper, it was argued that the covariates baseMean and the test statistics stat are approximately independent under the null hypothesis, which satisfied the requirements for the baseMean to be the covariates.

First we will load the proposed package r Biocpkg("OPWeight") and then show the steps to conduct $CRW$ tests:

# add a small constant to make all values positive
covariates = de_res_air$baseMean + .0001  
set.seed(123)
# one should use more nrep to obtain more acurate results
opw_results <- opw(pvalue = de_res_air$pvalue, covariate = covariates, nrep = 1000,
                   alpha = .1, tail = 2, effectType = "continuous", method = "BH")

**Figure 1:** The above plot shows the estimated $\lambda$ of the box-cox transformation and the corresponding $log-likelihood.$

The executed function $opw()$ returns a list of objects:

names(opw_results)

For example, the estimated proportion of the true null hypothesis:

opw_results$nullProp

The number of rejected null hypothesis:

opw_results$rejections

The plots of the probabilities, $P(r_i \mid \varepsilon_i)$, and the corresponding weights, $w_i$:

m = opw_results$totalTests
testRanks = 1:m
probs = opw_results$dataOut$ranksProb
weights = opw_results$dataOut$weight
dat = tibble(testRanks, probs, weights)
p_prob = ggplot(dat, aes(x = testRanks, y = probs)) + geom_point(size = .5, col="firebrick4") + 
    labs(x = "Test rank" , y = "p(rank | effect)")
p_weight = ggplot(dat, aes(x = testRanks, y = weights)) + geom_point(size = .5, col="firebrick4")+
    labs(x = "Test rank" , y = "Weight")
plot_grid(p_prob, p_weight, labels = c("a", "b"))

**Figure 2:** Rank probability (a) and Weight (b) versus Test rank

If one wants to test $H_0: \varepsilon_i=0$ vs. $H_0: \varepsilon_i > 0$, i.e., the effect sizes follow continuous distribution, then one should use effectType = "continuous". Similarly, if one wants to test $H_0: \varepsilon_i=0$ vs. $H_0: \varepsilon_i=\varepsilon$, i.e., effect sizes are binary; $0$ under the null model and a fixed value $\varepsilon$ under the alternative model, one should use effectType = "binary". $opw()$ function can provide results based on $FDR$ or $FWER$, and the default method is Benjamini-Hochberg [@benjamini1997false] $FDR$ method.

The main goal of the function is to compute the probabilities of the ranks from the p-values and the covariates, consequently computing the weights. Although weights and ranksProb are optional, $opw$ has the options so that one can compute the probabilities and the weights externally if necessary (see the Data analysis section).

Other parameters that we did not use in the $opw()$ function can be discussed here. Internally, the function computes the ranksprob and consequently the weights, then uses the p-values to formulate conclusions about the hypotheses. Therefore, if ranksprob is given then mean_covariateEffect is redundant and should not be provided to the function. Although ranksprob is not required for the function, one can compute ranksprob by using the function prob_rank_givenEffect.

The function $opw()$ internally computes mean_covariateEffect and mean_testEffect from a simple linear regression with box-cox transformation between the test and covariate statistics, where the covariates are regressed on the test statistics. Then the estimated mean_covariateEffect and mean_testEffect are used to obtain the ranksprob and the weights. Thus, in order to apply the function properly, it is crucial to understand the uses of the parameters mean_covariateEffect and mean_testEffect. Note that, covariates need to be positive to apply $box-cox$ from the R library r Biocpkg("MASS"). If the covariates are less than or equal to zero add small positive values with the covariates or externally supply mean_covariateEffect and mean_testEffect (see the data analysis section).

If mean_covariateEffect and mean_testEffect are not provided, then the test statistics computed from the p-values will be used to obtain the relationship between the covariates and the test statistics to compute the mean effects. If one of the mean effects is not given, then the missing mean effect will be computed internally.

There are many ways to obtain the mean of the test effects, mean_testEffect, and the corresponding value of the covariate effects, mean_covariateEffect; however, in the proposed $R$ function $opw()$, we used a simple linear regression with box-cox transformation. Sometimes the box-cox transformation may not be the optimal choice or one wants to use a different model to obtain the relationship. In that situation, one can acquire the relationship between the covariates and test statistics externally then provide ranksprob or mean_covariateEffect and mean_testEffect to perform the proposed optimal p-value weighting. In the following section, we will show the details analysis procedure.

Data analysis

Before applying the methods, a pre-screening analysis of the data set should be performed. Consider the above data set. First, let us generate some plots based on the covariates and the p-values.

Data <- tibble(pval=de_res_air$pvalue, covariate=de_res_air$baseMean)


hist_pval <- ggplot(Data, aes(x = Data$pval)) +
        geom_histogram(colour = "#1F3552", fill = "#4281AE")+
        labs(x = "P-values")

hist_covariate <- ggplot(Data, aes(x = Data$covariate)) +
        geom_histogram( colour = "#1F3552", fill = "#4274AE")+
        labs(x = "Covariate statistics")

pval_covariate <- ggplot(Data, aes(x = rank(-Data$covariate), y = -log10(pval))) +
        geom_point()+
        labs(x = "Ranks of covariates", y = "-log(pvalue)")

p_ecdf <- ggplot(Data, aes(x = pval)) +
            stat_ecdf(geom = "step")+
            labs(x = "P-values", title="Empirical cumulative distribution")+
            theme(plot.title = element_text(size = rel(.7)))

p <- plot_grid(hist_pval, hist_covariate, pval_covariate, p_ecdf, 
               labels = c("a", "b", "c", "d"), ncol = 2)
# now add the title
title <- ggdraw() + draw_label("Airway: data summary")
plot_grid(title, p, ncol = 1, rel_heights=c(0.1, 1))

Figure 3: (a) Distribution of the p-values, (b) distribution of the covariates, (c) relationship between the p-values and the rank of the covariate statistics, and (d) empirical cumulative distribution of the p-values.

From the pre-analysis plots, we see the distribution of the covariates is rightly skewed; therefore, to perform a linear regression, the covariates need to be transformed. In $opw()$ function, we used box-cox transformation. However, one can perform another transformation, such as $log(covariate)$, and fit a simple linear regression model, or apply other methods to obtain the relationship, for example, a generalized linear model. Although the transformed distribution still may not be suitable, this may not be a severe problem as long as the model fits the center of the data well because the proposed method only requires the center of the distribution.

We also observe from the plot that there is a weak relationship between the test statistics and the covariates; however, the ranked-covariate statistics show a potential relationship with the p-values, i.e., low p-values are enriched at the higher covariates. This relationship might be informative for the p-value-weighting because we are more interested in the ranking of the covariates. Enrichment of the low p-values at higher covariates also indicates that the covariates are correlated with the power under the alternative model. Furthermore, the bimodal distribution of the p-values indicates two-tailed test criteria are necessary because p-values close to $1$ are the cases that are significant in the opposite direction. We also observed the empirical cumulative distribution of the p-values. The empirical cumulative distribution shows that the curve is almost linear for the high p-values, which reveals the lesser importance of the higher p-values, and the size of the low p-values is very small.

We first fit a linear regression with box-cox transformation to obtain the estimated mean_covariateEffect and mean_testEffect in the following:

# initial stage--------
pvals = de_res_air$pvalue
tests = qnorm(pvals/2, lower.tail = FALSE)
covariates = de_res_air$baseMean + .0001    # to ensure covariates are postive for the box-cox

# formulate a data set-------------
Data = tibble(pvals, covariates)
OD <- Data[order(Data$covariates, decreasing=TRUE), ]
Ordered.pvalue <- OD$pvals

# estimate the true null and alternative test sizes------
m = length(Data$pvals); m
nullProp = qvalue(Data$pvals, pi0.method="bootstrap")$pi0; nullProp
m0 = ceiling(nullProp*m); m0
m1 = m - m0; m1

# etimated test efects of the true altrnatives------------
# keep top m1 tests for regression ---------
tests[which(!is.finite(tests))] <- NA
Data2 = add_column(Data, tests)
OD2 <- Data2[order(Data2$tests, decreasing=TRUE), ][1:m1, ]

test_effect_vec <-  OD2$tests

# fit box-cox regression
#--------------------------------
bc <- MASS::boxcox(covariates ~ tests, data = OD2)
lambda <- bc$x[which.max(bc$y)]
model <- lm(covariates**lambda ~ tests, data = OD2)


# for the continuous effects etimated mean effects
mean_testEffect = mean(test_effect_vec, na.rm = TRUE)
mean_testEffect
mean_covariateEffect = model$coef[[1]] + model$coef[[2]]*mean_testEffect
mean_covariateEffect

In order to obtain the mean effects, we compute the predicted value of the covariates corresponding to the mean or median of the test statistics. The mean or median value of the test statistics is then used as the estimated mean_testEffect, and the corresponding predicted value of the covariates is used as the estimated mean_covariateEffect.

Let us now compute the probabilities of the ranks of the covariates given the mean effect size computed above. Note that for the ranks probability, et and ey are the same because the mean effect of the covariates is only a single value.

set.seed(123)
# compute ranks probability
prob_cont <- sapply(1:m, prob_rank_givenEffect, et = mean_covariateEffect,
                   ey = mean_covariateEffect, nrep = 1000, m0 = m0, m1 = m1)

Then applying the probabilities to compute the weights

# compute weihgts
wgt = weight_continuous_nwt(alpha = .1, et = mean_testEffect, ranksProb = prob_cont)$w

Finally, obtain the total number and the list of the significant tests

alpha = .1
padj <- p.adjust(Ordered.pvalue/wgt, method = "BH")
rejections_list = OD[which((padj <= alpha) == TRUE), ]
n_rejections = dim(rejections_list)[1]
n_rejections

or apply $opw$ function to compute the final results

opw_results2 <- opw(pvalue = pvals, covariate = covariates, weight = wgt, 
                     effectType = "continuous", alpha = .1, method = "BH")
opw_results2$rejections

The above procedures are performed by the $opw()$ function internally by default; however, if one wants to try a different approach such as fitting a generalized linear model instead of the box-cox transformation, then one can apply a different model and follow similar procedures to compute the parameters mean_covariateEffect and mean_testEffect, or weight or ranksProb, then apply the results in $opw()$ function. Sometime it is necessary to perform parallel computing, especially if the number of the hypothesis tests is large (e.g. millions). In this context, the following data analysis procedure could be convenient instead of using the $opw()$ function directly.

Other functions

In the package r Biocpkg("OPweight"), there are other useful functions: 1) prob_rank_givenEffect, 2) weight_binary_nwt, 3) weight_continuous_nwt, 4) weight_binary, 5) weight_continuous, 6) weight_by_delta, 7) prob_rank_givenEffect_approx, 8) prob_rank_givenEffect_exact*, and 9) prob_rank_givenEffect_simu*. The first three functions are used inside the main function $opw()$. Functions $2$ and $3$ are based on the Newton-Raphson algorithm, and hence, are very fast. However, this algorithm depends on a correct guess of the initial value. The function internally adjusts the initial value. One can also provide the initial value externally.

Functions $4$ and $5$ are based on the grid search algorithm; therefore, it is very slow. However, these functions always work and can be used as replacements for functions $2$ and $3$.

The Sixth function is used inside Functions $4$ and $5$ and the remaining functions are provided if anyone wishes to see the behavior of the ranks probability of the test statistic, $p(r_i \mid \varepsilon_i)$. In the original article, we proposed the probability method and showed an exact mathematical formula, then verified the formula by simulations. The exact method requires intensive computation; therefore, we proposed an approximation. By applying the later three functions one can easily observe that all approaches perform similarly. However, for a large number of tests, the simulation and the exact approach are computationally very expensive; therefore, the approximation method is a better option. In fact, we actually need $p(r_i \mid E(\varepsilon_i))$, which is obtained by the function prob_rank_givenEffect. For the details see Hasan and Schliekelman, (2017) and the corresponding supplementary materials.