wmwTest: Wilcoxon-Mann-Whitney rank sum test for high-throughput...

Description Usage Arguments Details Value Methods (by class) Note Author(s) References See Also Examples

Description

We have implemented a highly efficient Wilcoxon-Mann-Whitney rank sum test for high-throughput expression profiling data. For datasets with more than 100 features (genes), the function can be more than 1,000 times faster than its R implementations (wilcox.test in stats, or rankSumTestWithCorrelation in limma).

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
wmwTest(
  x,
  indexList,
  col = "GeneSymbol",
  valType = c("p.greater", "p.less", "p.two.sided", "U", "abs.log10p.greater",
    "log10p.less", "abs.log10p.two.sided", "Q"),
  simplify = TRUE
)

## S4 method for signature 'matrix,IndexList'
wmwTest(x, indexList, valType, simplify = TRUE)

## S4 method for signature 'numeric,IndexList'
wmwTest(x, indexList, valType, simplify = TRUE)

## S4 method for signature 'matrix,GmtList'
wmwTest(x, indexList, valType, simplify = TRUE)

## S4 method for signature 'eSet,GmtList'
wmwTest(
  x,
  indexList,
  col = "GeneSymbol",
  valType = c("p.greater", "p.less", "p.two.sided", "U", "abs.log10p.greater",
    "log10p.less", "abs.log10p.two.sided", "Q"),
  simplify = TRUE
)

## S4 method for signature 'eSet,numeric'
wmwTest(
  x,
  indexList,
  col = "GeneSymbol",
  valType = c("p.greater", "p.less", "p.two.sided", "U", "abs.log10p.greater",
    "log10p.less", "abs.log10p.two.sided", "Q"),
  simplify = TRUE
)

## S4 method for signature 'eSet,logical'
wmwTest(
  x,
  indexList,
  col = "GeneSymbol",
  valType = c("p.greater", "p.less", "p.two.sided", "U", "abs.log10p.greater",
    "log10p.less", "abs.log10p.two.sided", "Q"),
  simplify = TRUE
)

## S4 method for signature 'eSet,list'
wmwTest(
  x,
  indexList,
  col = "GeneSymbol",
  valType = c("p.greater", "p.less", "p.two.sided", "U", "abs.log10p.greater",
    "log10p.less", "abs.log10p.two.sided", "Q"),
  simplify = TRUE
)

## S4 method for signature 'ANY,numeric'
wmwTest(x, indexList, valType, simplify = TRUE)

## S4 method for signature 'ANY,logical'
wmwTest(x, indexList, valType, simplify = TRUE)

## S4 method for signature 'ANY,list'
wmwTest(x, indexList, valType, simplify = TRUE)

## S4 method for signature 'matrix,SignedIndexList'
wmwTest(x, indexList, valType, simplify = TRUE)

## S4 method for signature 'matrix,SignedGenesets'
wmwTest(x, indexList, valType, simplify = TRUE)

## S4 method for signature 'numeric,SignedIndexList'
wmwTest(x, indexList, valType, simplify = TRUE)

## S4 method for signature 'eSet,SignedIndexList'
wmwTest(x, indexList, valType, simplify = TRUE)

## S4 method for signature 'eSet,SignedGenesets'
wmwTest(
  x,
  indexList,
  col = "GeneSymbol",
  valType = c("p.greater", "p.less", "p.two.sided", "U", "abs.log10p.greater",
    "log10p.less", "abs.log10p.two.sided", "Q"),
  simplify = TRUE
)

Arguments

x

A numeric matrix. All other data types (e.g. numeric vectors or ExpressionSet objects) are coerced into matrix.

indexList

A list of integer indices (starting from 1) indicating signature genes. Can be of length zero. Other data types (e.g. a list of numeric or logical vectors, or a numeric or logical vector) are coerced into such a list. See details below for a special case using GMT files.

col

a string sometimes used with a eSet

valType

The value type to be returned, allowed values include p.greater, p.less, abs.log10p.greater and abs.log10p.less (one-sided tests),p.two.sided, and U statistic, and their log10 transformation variants. See details below.

simplify

Logical. If not, the returning value is in matrix format; if set to TRUE, the results are simplified into vectors when possible (default).

Details

The basic application of the function is to test the enrichment of gene sets in expression profiling data or differentially expressed data (the matrix with feature/gene in rows and samples in columns).

A special case is when x is an eSet object (e.g. ExpressionSet), and indexList is a list returned from readGmt function. In this case, the only requirement is that one column named GeneSymbol in the featureData contain gene symbols used in the GMT file. The same applies to signed Gmt files. See the example below.

Besides the conventional value types such as ‘p.greater’, ‘p.less’, ‘p.two.sided’ , and ‘U’ (the U-statistic), wmwTest (from version 0.99-1) provides further value types: abs.log10p.greater and log10p.less perform log10 transformation on respective p-values and give the transformed value a proper sign (positive for greater than, and negative for less than); abs.log10p.two.sided transforms two-sided p-values to non-negative values; and Q score reports absolute log10-transformation of p-value of the two-side variant, and gives a proper sign to it, depending on whether it is rather greater than (positive) or less than (negative).

Value

A numeric matrix or vector containing the statistic.

Methods (by class)

Note

The function has been optimized for expression profiling data. It avoids repetitive ranking of data as done by native R implementations and uses efficient C code to increase the performance and control memory use. Simulation studies using expression profiles of 22000 genes in 2000 samples and 200 gene sets suggested that the C implementation can be >1000 times faster than the R implementation. And it is possible to further accelerate by parallel calling the function with mclapply in the multicore package.

Author(s)

Jitao David Zhang <jitao_david.zhang@roche.com>

References

Barry, W.T., Nobel, A.B., and Wright, F.A. (2008). A statistical framework for testing functional namespaces in microarray data. _Annals of Applied Statistics_ 2, 286-315.

Wu, D, and Smyth, GK (2012). Camera: a competitive gene set test accounting for inter-gene correlation. _Nucleic Acids Research_ 40(17):e133

Zar, JH (1999). _Biostatistical Analysis 4th Edition_. Prentice-Hall International, Upper Saddle River, New Jersey.

See Also

codewilcox.test in the stats package, and rankSumTestWithCorrelation in the limma package.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
## R-native data structures
set.seed(1887)
rd <- rnorm(1000)
rl <- sample(c(TRUE, FALSE), 1000, replace=TRUE)
wmwTest(rd, rl, valType="p.two.sided")
wmwTest(rd, which(rl), valType="p.two.sided")
rd1 <- rd + ifelse(rl, 0.5, 0)
wmwTest(rd1, rl, valType="p.greater")
wmwTest(rd1, rl, valType="U")
rd2 <- rd - ifelse(rl, 0.2, 0)
wmwTest(rd2, rl, valType="p.greater")
wmwTest(rd2, rl, valType="p.two.sided")
wmwTest(rd2, rl, valType="p.less")

## matrix forms
rmat <- matrix(c(rd, rd1, rd2), ncol=3, byrow=FALSE)
wmwTest(rmat, rl, valType="p.two.sided")
wmwTest(rmat, rl, valType="p.greater")

wmwTest(rmat, which(rl), valType="p.two.sided")
wmwTest(rmat, which(rl), valType="p.greater")

## other valTypes
wmwTest(rmat, which(rl), valType="U")
wmwTest(rmat, which(rl), valType="abs.log10p.greater")
wmwTest(rmat, which(rl), valType="log10p.less")
wmwTest(rmat, which(rl), valType="abs.log10p.two.sided")
wmwTest(rmat, which(rl), valType="Q")

## using ExpressionSet
data(sample.ExpressionSet)
testSet <- sample.ExpressionSet
fData(testSet)$GeneSymbol <- paste("GENE_",1:nrow(testSet), sep="")
mySig1 <- sample(c(TRUE, FALSE), nrow(testSet), prob=c(0.25, 0.75), replace=TRUE)
wmwTest(testSet, which(mySig1), valType="p.greater")

## using integer
exprs(testSet)[,1L] <- exprs(testSet)[,1L] + ifelse(mySig1, 50, 0)
wmwTest(testSet, which(mySig1), valType="p.greater")

## using lists
mySig2 <- sample(c(TRUE, FALSE), nrow(testSet), prob=c(0.6, 0.4), replace=TRUE)
wmwTest(testSet, list(first=mySig1, second=mySig2))
## using GMT file
gmt_file <- system.file("extdata/exp.tissuemark.affy.roche.symbols.gmt", package="BioQC")
gmt_list <- readGmt(gmt_file)

gss <- sample(unlist(sapply(gmt_list, function(x) x$genes)), 1000)
eset<-new("ExpressionSet",
         exprs=matrix(rnorm(10000), nrow=1000L),
         phenoData=new("AnnotatedDataFrame", data.frame(Sample=LETTERS[1:10])),
         featureData=new("AnnotatedDataFrame",data.frame(GeneSymbol=gss)))
esetWmwRes <- wmwTest(eset ,gmt_list, valType="p.greater")
summary(esetWmwRes)

## using signed GMT file
signed_gmt_file <- system.file("extdata/test.gmt", package="BioQC")
signed_gmt <- readSignedGmt(signed_gmt_file)
esetSignedWmwRes <- wmwTest(eset, signed_gmt, valType="p.greater")

esetMat <- exprs(eset); rownames(esetMat) <- fData(eset)$GeneSymbol
esetSignedWmwRes2 <- wmwTest(esetMat, signed_gmt, valType="p.greater")

Example output

Loading required package: Biobase
Loading required package: BiocGenerics
Loading required package: parallel

Attaching package:BiocGenericsThe following objects are masked frompackage:parallel:

    clusterApply, clusterApplyLB, clusterCall, clusterEvalQ,
    clusterExport, clusterMap, parApply, parCapply, parLapply,
    parLapplyLB, parRapply, parSapply, parSapplyLB

The following objects are masked frompackage:stats:

    IQR, mad, sd, var, xtabs

The following objects are masked frompackage:base:

    anyDuplicated, append, as.data.frame, basename, cbind, colnames,
    dirname, do.call, duplicated, eval, evalq, Filter, Find, get, grep,
    grepl, intersect, is.unsorted, lapply, Map, mapply, match, mget,
    order, paste, pmax, pmax.int, pmin, pmin.int, Position, rank,
    rbind, Reduce, rownames, sapply, setdiff, sort, table, tapply,
    union, unique, unsplit, which.max, which.min

Welcome to Bioconductor

    Vignettes contain introductory material; view with
    'browseVignettes()'. To cite Bioconductor, see
    'citation("Biobase")', and for packages 'citation("pkgname")'.

[1] 0.8535404
[1] 0.8535404
[1] 1.949087e-14
[1] 90452
[1] 0.9976222
[1] 0.004758899
[1] 0.002379449
[1] 8.535404e-01 3.898175e-14 4.758899e-03
[1] 4.267702e-01 1.949087e-14 9.976222e-01
[1] 8.535404e-01 3.898175e-14 4.758899e-03
[1] 4.267702e-01 1.949087e-14 9.976222e-01
[1] 124152  90452 137887
[1]  0.369805934 13.710168669  0.001033906
[1] -2.416062e-01 -8.437865e-15 -2.623524e+00
[1]  0.06877594 13.40913867  2.32249352
[1]  0.06877594 13.40913867 -2.32249352
        A         B         C         D         E         F         G         H 
0.3968848 0.4750041 0.4090897 0.5238279 0.5757461 0.4790929 0.4343313 0.6157803 
        I         J         K         L         M         N         O         P 
0.6872272 0.5165239 0.4499611 0.5820556 0.4755881 0.4934181 0.4666891 0.4785086 
        Q         R         S         T         U         V         W         X 
0.6280457 0.2976157 0.3232340 0.6152199 0.6149396 0.3576255 0.3540736 0.4102298 
        Y         Z 
0.5086293 0.4288550 
           A            B            C            D            E            F 
4.427725e-06 4.750041e-01 4.090897e-01 5.238279e-01 5.757461e-01 4.790929e-01 
           G            H            I            J            K            L 
4.343313e-01 6.157803e-01 6.872272e-01 5.165239e-01 4.499611e-01 5.820556e-01 
           M            N            O            P            Q            R 
4.755881e-01 4.934181e-01 4.666891e-01 4.785086e-01 6.280457e-01 2.976157e-01 
           S            T            U            V            W            X 
3.232340e-01 6.152199e-01 6.149396e-01 3.576255e-01 3.540736e-01 4.102298e-01 
           Y            Z 
5.086293e-01 4.288550e-01 
                  A         B         C         D         E         F         G
first  4.427725e-06 0.4750041 0.4090897 0.5238279 0.5757461 0.4790929 0.4343313
second 3.322462e-01 0.5830970 0.4508703 0.4503638 0.4788279 0.2795935 0.5079108
               H         I         J         K         L         M         N
first  0.6157803 0.6872272 0.5165239 0.4499611 0.5820556 0.4755881 0.4934181
second 0.5554548 0.6009788 0.5637822 0.4024829 0.4129130 0.5557076 0.4844365
               O         P         Q         R         S         T         U
first  0.4666891 0.4785086 0.6280457 0.2976157 0.3232340 0.6152199 0.6149396
second 0.5477364 0.5394920 0.4508703 0.5673067 0.4577155 0.4478326 0.4821416
               V         W         X         Y         Z
first  0.3576255 0.3540736 0.4102298 0.5086293 0.4288550
second 0.3385415 0.6612245 0.5985069 0.6346372 0.3837895
       1                 2                 3                  4           
 Min.   :0.02198   Min.   :0.01316   Min.   :0.005433   Min.   :0.007333  
 1st Qu.:0.28609   1st Qu.:0.24582   1st Qu.:0.194880   1st Qu.:0.233636  
 Median :0.55866   Median :0.51048   Median :0.460976   Median :0.598480  
 Mean   :0.55896   Mean   :0.50834   Mean   :0.470506   Mean   :0.547411  
 3rd Qu.:0.79758   3rd Qu.:0.72584   3rd Qu.:0.706622   3rd Qu.:0.848243  
 Max.   :1.00000   Max.   :1.00000   Max.   :1.000000   Max.   :1.000000  
       5                  6                 7                 8            
 Min.   :0.002327   Min.   :0.01225   Min.   :0.00879   Min.   :0.0002731  
 1st Qu.:0.262446   1st Qu.:0.27261   1st Qu.:0.23966   1st Qu.:0.2535412  
 Median :0.531550   Median :0.58203   Median :0.52827   Median :0.5691982  
 Mean   :0.523777   Mean   :0.55246   Mean   :0.52780   Mean   :0.5251984  
 3rd Qu.:0.773279   3rd Qu.:0.79958   3rd Qu.:0.80507   3rd Qu.:0.7977362  
 Max.   :1.000000   Max.   :1.00000   Max.   :1.00000   Max.   :1.0000000  
       9                  10          
 Min.   :0.001639   Min.   :0.009709  
 1st Qu.:0.265712   1st Qu.:0.273843  
 Median :0.503677   Median :0.517334  
 Mean   :0.513945   Mean   :0.531123  
 3rd Qu.:0.738335   3rd Qu.:0.817494  
 Max.   :1.000000   Max.   :1.000000  

BioQC documentation built on Nov. 8, 2020, 7:16 p.m.