title: "BioQC: Detect tissue heterogeneity in gene expression data" author: "Jitao David Zhang, Klas Hatje, Gregor Sturm, Clemens Broger, Martin Ebeling, and Laura Badi" date: "2017-07-27" vignette: > %\VignetteEngine{knitr::rmarkdown} %\VignetteIndexEntry{BioQC: Detect tissue heterogeneity in gene expression data} %\usepackage[utf8]{inputenc} output: md_document: variant: markdown_phpextra preserve_yaml: TRUE html_document: toc: true

The Wilcoxon-Mann-Whitney rank sum test is one of the most commonly used
algorithms for non-parametric tests. The current implementations in R
however are not optimal for multiple hypothesis testing. The *BioQC*
package implements the algorithm in a native C routine to allow fast
computation in scenarios where a numeric vector (*e.g.* expression
values of all genes) is compared against a collection of subsets of the
vector (*e.g.* expression values of genes belonging to a collection of
gene sets). We demonstrate the use of the package with the *BioQC*
algorithm, which uses the Wilcoxon-Mann-Whitney rank sum test and a
collection of tissue-preferentially gene signatures to detect tissue
heterogeneity in high-throughput gene expression studies.

In this vignette we demonstrate the use of *BioQC* by a simulated
expression dataset, and compare the performance of Wilcoxon-Mann-Whitney
rank sum test algorithm implemented in *BioQC* to other implementations
available in R. To use *BioQC*, the users only need to provide an
expression dataset, in the form of a numeric matrix, or an
*ExpressionSet* object. The *BioQC* package provides tissue-specific
genes that can be used directly with the algorithm. The output is one
score for each tissue type and each sample. The ranks of the score
within each sample can be compared with prior knowledge about the sample
to infer tissue heterogeneity. The hypotheses generated by *BioQC*
should be further tested in follow-up experiments.

We demonstrate the basic use of the package with a dummy example. First,
we load *BioQC* library and the tissue signatures into the R session.

```
library(Biobase)
library(BioQC)
gmtFile <- system.file("extdata/exp.tissuemark.affy.roche.symbols.gmt", package="BioQC")
gmt <- readGmt(gmtFile)
```

Next, we synthesize an ExpressionSet object, using randomly generated data.

```
Nrow <- 2000L
Nsample <- 5L
gss <- unique(unlist(sapply(gmt, function(x) x$genes)))
myEset <- new("ExpressionSet",
exprs=matrix(rnorm(Nrow*Nsample), nrow=Nrow),
featureData=new("AnnotatedDataFrame",
data.frame(GeneSymbol=sample(gss, Nrow))))
```

Finally we run the *BioQC* algorithm and print the summary of the
results. As expected, no single tissue scored significantly after
multiple correction.

```
dummyRes <- wmwTest(myEset, gmt, valType="p.greater", simplify=TRUE)
summary(p.adjust(dummyRes, "BH"))
```

```
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.4536 0.9498 0.9888 0.9744 1.0000 1.0000
```

The dummy example above shows how to run *BioQC* algorithm with an
*ExpressionSet* and a list read from a GMT file (a file format capturing
gene sets). Users can also use more basic data structures (*e.g.*
matrices and list of integer indexes) to run the algorithm as shown by
the following example. Other data structures will be coerced into these
basic data structures; we refer the interested user to the documentation
of the *wmwTest* function.

```
myMatrix <- matrix(rnorm(Nrow*Nsample),
ncol=Nsample,
dimnames=list(NULL, LETTERS[1:Nsample]))
myList <- list(signature1=sample(1:Nrow, 100),
signature2=sample(1:Nrow, 50),
signature3=sample(1:Nrow, 200))
wmwTest(myMatrix, myList, valType="p.greater", simplify=TRUE)
```

```
## A B C D E
## signature1 0.4282743 0.3868482 0.7977667 0.8877019 0.15461514
## signature2 0.5128106 0.8417263 0.3876947 0.3321845 0.90683395
## signature3 0.8421740 0.6157008 0.2904893 0.5908780 0.08307253
```

We have applied *BioQC* to a real gene expression profiling dataset.
*BioQC* helped to generated hypotheses about potential contamination of
rat kidney examples by pancreas tissues, which could be confirmed with
qRT-PCR.

The data and the script used to perform the analysis can be found in the github repository BioQC-example.

In the core of *wmwTest*, an efficient C-implementation makes it
feasible to run a large number of Wilcoxon-Mann-Whitney tests on
large-scale expression profiling datasets and with many signature lists.
Compared to native R implementations in *stats* (*wilcox.text*) and
*limma* (*rankSumTestWithCorrelation*) packages, the *BioQC*
implementation requires less memory and avoids repetitive statistical
ranking of data.

The following code, though in a small scale, demonstrates the difference between the performances of two implementations.

```
bm.Nrow <- 22000
bw.Nsample <- 5
bm.Ngs <- 5
bm.Ngssize <- sapply(1:bm.Ngs, function(x) sample(1:bm.Nrow/2, replace=TRUE))
ind <- lapply(1:bm.Ngs, function(i) sample(1:bm.Nrow, bm.Ngssize[i]))
exprs <- matrix(round(rnorm(bm.Nrow*bw.Nsample),4), nrow=bm.Nrow)
system.time(Cres <- wmwTest(exprs, ind, valType="p.less", simplify=TRUE))
```

```
## user system elapsed
## 0.152 0.000 0.325
```

```
system.time(Rres <- apply(exprs, 2, function(x)
sapply(ind, function(y)
wmwTestInR(x, y, valType="p.less"))))
```

```
## user system elapsed
## 2.560 0.008 5.364
```

With 22000 genes, five samples, and five gene sets, the *BioQC*
implementation is about 20x faster than the R implementation (dependent
on individual machines and settings). Our benchmark shows that with the
same number of genes, 2000 samples and 200 gene sets (similar to the
total number of tissues collected in the *BioQC* signature list), the
*BioQC* implementation can be about 1000x faster than the R
implementation.

```
sessionInfo()
```

```
## R version 3.4.0 (2017-04-21)
## Platform: i686-pc-linux-gnu (32-bit)
## Running under: Linux Mint 18
##
## Matrix products: default
## BLAS: /usr/lib/libblas/libblas.so.3.6.0
## LAPACK: /usr/lib/lapack/liblapack.so.3.6.0
##
## locale:
## [1] LC_CTYPE=de_CH.UTF-8 LC_NUMERIC=C
## [3] LC_TIME=de_CH.UTF-8 LC_COLLATE=de_CH.UTF-8
## [5] LC_MONETARY=de_DE.UTF-8 LC_MESSAGES=de_CH.UTF-8
## [7] LC_PAPER=de_DE.UTF-8 LC_NAME=C
## [9] LC_ADDRESS=C LC_TELEPHONE=C
## [11] LC_MEASUREMENT=de_DE.UTF-8 LC_IDENTIFICATION=C
##
## attached base packages:
## [1] parallel methods stats graphics grDevices utils datasets
## [8] base
##
## other attached packages:
## [1] BioQC_1.5.1 Rcpp_0.12.12 Biobase_2.34.0
## [4] BiocGenerics_0.20.0 knitr_1.16.6
##
## loaded via a namespace (and not attached):
## [1] digest_0.6.12 rprojroot_1.2 backports_1.1.0 magrittr_1.5
## [5] evaluate_0.10.1 stringi_1.1.5 rmarkdown_1.6 tools_3.4.0
## [9] stringr_1.2.0 yaml_2.1.14 compiler_3.4.0 htmltools_0.3.6
```

**Any scripts or data that you put into this service are public.**

Embedding an R snippet on your website

Add the following code to your website.

For more information on customizing the embed code, read Embedding Snippets.