DQ: Improved q-values for discrete uniform and homogeneous tests

Description Usage Arguments Details Value Author(s) References Examples

View source: R/DQ.R

Description

Performs the five versions of the q-value method considered in Cousido-Rocha et al. (2019). The q-value method is based on the false discovery rate (FDR), and the versions differ in the estimator of the proportion of true null hypotheses, π_0, which is plugged in the FDR estimator. Specifically, we consider as possible estimators for π_0: two usual estimators for continuous and possibly heterogeneous P-values; an estimator for discrete P-values defined in two steps: firstly the P-values are randomized, and then the usual π_0 estimator for continuous P-values is applied; and the estimators recently proposed for discrete P-values by Liang (2016) and Chen et al. (2014).

Usage

1
2
3
4
5
6
7
DQ(
  pv,
  ss = NULL,
  ss_inf = FALSE,
  method = c("ST", "SS", "Liang", "Chen", "Rand"),
  lambda = seq(0.05, 0.95, 0.05)
)

Arguments

pv

A vector of P-values.

ss

Support of the discrete distribution of the P-values. Only required for “Liang”, “Chen” and “Rand” methods which are specifically proposed for discrete P-values. If the P-values are continuous the methods “ST” and “SS” do not need this argument, hence “ss=NULL” by default.

ss_inf

Logical. Default is FALSE. A variable indicating whether the support of the discrete distribution of the P-values is finite or infinite. See details.

method

The q-value method. By default the “Chen” method is computed.

lambda

The value of the tuning parameter to estimate π_0 when the method is “ST”. See details.

Details

The function implements the five different versions of the q-value method in Cousido-Rocha et al. (2019). Three versions are novel adaptions for the case of homogeneous discrete uniform P-values, whereas the other two are classical versions of the q-value method for P-values which follow a continuous uniform distribution under the global null hypothesis. The classical versions are the q-value method based on the π_0 estimator proposed in Storey (2002) with tunning parameter lambda = 0.5, and the q-value procedure which uses the π_0 estimator in Storey and Tibshirani (2003) who proposed an automatic method to estimate π_0. We refer to these methods as “SS” and “ST”, respectively. On the other hand the three adaptations of the q-value method for homogeneous discrete uniform P-values are: “Liang” which considers the π_0 estimator proposed in Liang (2016); “Chen” which uses a simplification for homogeneous discrete P-values of the algorithm for the estimation of π_0 proposed in Chen et al. (2014); and “Rand” which employs the standard q-value procedure but applied to randomized P-values. For details of the different q-value versions, in particular for the novel adaptations for homogeneous discrete uniform P-values, see Cousido-Rocha et al. (2019).

As we mentioned above the novel adaptations of the q-value method are developed for homogeneous discrete uniform P-values. Specifically, suppose that we test a large number of null hypothesis, p, and that the P-values {pv_1, ... , pv_p} are observations of the random variables PV_i, i = 1, ... , p. Homogeneous means that all the P-values share an identical support S with 0 < t_1 < ··· < t_s < t_{s+1} ≡ 1. On the other hand, making an abuse of language, we say that the P-values follow a discrete uniform distribution if it holds Pr(PV_i ≤ t) = t for t in S, i = 1, ..., p. The classical discrete uniform distribution is a particular case.

The argument “lambda” must be a sequence of values in [0, 1), for details of this parameter see Storey (2002) or Storey and Tibshirani (2003). The latter paper recommends the default value “lambda=seq(0.05,0.95,0.05)”.

The support of the discrete distribution of the P-values can be finite or infinte. Hence the parameter “ss inf” must be “FALSE” if the support is finite and “TRUE” if the support is infinite. See examples where a poisson setting is considered.

Finally, it is relevant to mention that Cousido-Rocha et al. (2019) verified (via simulations) that the results of the different q-values methods for dependent P-values are very similar to the ones corresponding to the independent setting.

Value

A list containing the following components:

pi0

An estimate of the proportion of null P-values.

q.values

A vector of the estimated q-values (the main quantities of interest).

Author(s)

References

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
# We consider a simple simulated data set to illustrate the use of the DQ function.
# We have simulated the following situation.
# We have two groups, for example, 5 patients with tumor 1 and 5 patients
# with tumor 2. For each patient 100 variables are measured, for example,
# gene expression levels. Besides, the distributions of 30 of the variables
# are different in the two groups, and the differences are in location.


# We consider a collection of densities {f1=N(0,1), f2=N(0,1/4), f3=N(2,1), f4=N(2,1/4)}. In
# the first group (tumor 1) the sample of each variable (gene) comes from one of
# the four densities with the same probability 1/4. On the other hand, in the second
# group the sample of each variable comes from the same density as in the first
# group except for 30 randomly selected variables for which the density changes
# producing location differences. Specifically, if the variable follows
# f1 in the first group, its density, in the second group, is f3  producing a
# change on its location parameter.The situation for the other cases is as follows:
# the density f2 (f3 or f4) in group 1 leads to density f4 (f1 or f2, respectively)
# in the second one.

set.seed(123)
p <- 100
n = m = 5
inds <- sample(1:4, p, replace = TRUE)
X <- matrix(rep(0, n * p), ncol = n)

for (j in 1:p){
  if (inds[j] == 1){
    X[j, ] <- rnorm(n)}
  if (inds[j] == 2){
    X[j, ] <- rnorm(n, sd = sqrt(1/4))
  }
  if (inds[j] == 3){
    X[j, ]<-rnorm(n, mean = 2)
  }
  if (inds[j] == 4){
    X[j, ]<-rnorm(n, mean = 2, sd = sqrt(1/4))
  }
}

rho <- 0.3

ind <- sample(1:p, rho * p)
li <- length(ind)
indsy <- inds

for (l in 1:li){
  if (indsy[ind[l]] == 1){indsy[ind[l]] = 3} else{
    if (indsy[ind[l]] == 2){indsy[ind[l]] = 4} else {
      if (indsy[ind[l]] == 3){indsy[ind[l]] = 1}
      else{indsy[ind[l]] = 2}}}}

Y <- matrix(rep(0, m * p), ncol = m)

for (j in 1:p){
  if (indsy[j] == 1){
    Y[j, ] <- rnorm(m)}
  if (indsy[j] == 2){
    Y[j, ] <- rnorm(m, sd = sqrt(1/4))
  }
  if (indsy[j] == 3){
    Y[j, ] <- rnorm(m, mean = 2)
  }
  if (indsy[j] == 4){
    Y[j, ]<- rnorm(m, mean = 2, sd = sqrt(1/4))
  }
}


# We can see which are the variables with different distributions in the two data sets.

dif <- which(inds != indsy)

# Cross table for (X,Y) density indexes:

table(inds,indsy)

# Our interest is to identify which variables have a different distribution in the two groups.
# Hence, since the differences between the distributions are in location, we applied
# Wilcoxon-Mann-Whitney test to verify for each variable the equality of distribution
# in the two groups.



library(exactRankTests)
library(coin)

# We compute the P-values

p <- nrow(X)
pv <- 1:p
for (i in 1:p){
  pv[i] <- wilcox.exact(X[i, ], Y[i, ])$p.value
}

# When the sample size is small, in this case n=m=5, the distribution of
# the Wilcoxon's statistic is calibrated using an exact permutation test. Hence,
# the P-values are homogeneous discrete uniform distributed with support points
# of such distribution:

ss <- c(1, 2, 4, 7, 12, 19, 28, 39, 53,69, 87, 106, 126) / 126

# When the number of P-values is large enough "ss" is equal to:
sort(unique(pv))

# For details about the Wilcoxon-Mann-Whitney test and its exact distribution
# see Section 9.2 of Gibbons and Chakraborti (1992).

# We apply Chen method:

R <- DQ(pv, ss = ss, method = "Chen")

# The estimate of the proportion of null P-values:

R$pi0

# Summary of the vector of the estimated q-values:

summary(R$q.values)

# How many null hypotheses are rejected?
alpha <- 0.05
sum(R$q.values < alpha)

# Which variables correspond to such null hypotheses?

which(R$q.values < alpha)

# Classification table (Decision at nominal level alpha vs. reality):

table(R$q.values < alpha,inds != indsy)

# The conclusion from the previous table is that Chen method reports
# 21 true positives and 9 false negatives.


# We can also apply Liang and SS methods as follows.
RLiang <- DQ(pv, ss = ss, method = "Liang")
RSS <- DQ(pv, ss = ss, method = "SS")

# The next graphic help us to see that Liang method (for discrete P-values)
# is more powerful than SS method (only suitable for continuous P-values).

plot(RSS$q.values,RLiang$q.values)
abline(a = 0, b = 1, col = 2, lty = 2)

# We consider a poisson setting to show a case where the support of the discrete
# distribution of the P-values is infinte.
# We generate 100 values of a poisson distribution with event rate 10.
# Then we compute the probability that each of the values come from a
# a poisson distribution with event rate 10. This set of probabilities
# are considered as our set of P-values.

p <- 100
N <- rpois(p, lambda = 10)
pv <- 1:p
for(i in 1:p){
  pv[i] <- ppois(N[i], lambda = 10)
}

# It is well know that the support of a poisson distribution is infinite
# and equal to the natural numbers. Hence to know the support of the P-values
# defined above, we compute for 1,..., 50 their corresponding P-value.
# We only considered 50 values because for large values than 50 the P-value is 1 again.

nn_0 <- 50
ss <- 1:(nn_0 + 1)
for (i in 1:(nn_0 + 1)){
 ss[i] <- ppois(i - 1, lambda = 10)
}

# We eliminate repeated values.
ss <- unique(ss)
# For Chen method the relevant support points are only the values below tau[100] = 0.5.
# We define the support ss as such values. Then, we can apply Chen method. Of
# course s_inf = TRUE.

indicator <- which(ss <= 0.5)
ssi <- ss[indicator]
R <- DQ(pv, ss = ssi, ss_inf = "TRUE", method = "Chen")
# For Liang method the relevant support values are also the values below 0.5, hence
# ss defined above is suitable.
R <- DQ(pv, ss = ssi, ss_inf = "TRUE", method = "Liang")

# For Rand method the relevant support values are those ones with match with
# the P-values, and also the largest support points smaller than each one of such P-values.
# Hence, ss only includes the relevant points, as we can see below.
pv_unique <- unique(pv)
p_u <- length(pv_unique)
ind <- 1:p_u
for(i in 1:p_u){
  ind[i] <- which(ss == pv_unique[i])
}
p_u == length(ind)
ind_minus <- ind - 1
ind_final <- unique(sort(c(ind, ind_minus)))
ss <- ss[ind_final]
# Now, we can apply Rand method.
R <- DQ(pv, ss = ss, ss_inf = "TRUE", method = "Rand")

DiscreteQvalue documentation built on April 2, 2020, 1:06 a.m.