knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
OVESEG-test (One Versus Everyone Subtype Exclusively-expressed Genes test) is a statistically-principled multiple-group comparison method that can detect tissue/cell-specific marker genes (MGs) among many subtypes (e.g. tissue/cell types) [@oveseg]. To assess the statistical significance of MGs, OVESEG-test uses a specifically designed test statistics that mathematically matches the definition of MGs, and employs a novel permutation scheme to estimate the corresponding distribution under null hypothesis where the expression patterns of non-MGs can be highly complex.
OVESEG
package provides functions to compute OVESEG-test statistics, derive component weights in the mixture null distribution model and estimate p-values from weightedly aggregated permutations. While OVESEG-test p-values can be used to detect MGs, component weights of hypotheses also help portrait all kinds of upregulation patterns among tissue/cell types.
The function OVESEGtest()
includes all necessary steps to obtain OVESEG-test p-values. You need to specify the expression matrix, tissue/cell type labels, number of permutations, and parallel option. For microarray data, the expressions need to be log2-transformed prior to the test. For RNAseq count data, the counts need to be transformed to logCPM (log2-counts per million) prior to the test and use limma::voom()
to incorporate the mean-variance relationship. Parallel option is set according to r Biocpkg("BiocParallel")
package.
#Microarray rtest <- OVESEGtest(y, group, NumPerm=999, BPPARAM=BiocParallel::SnowParam()) #RNAseq yvoom <- limma::voom(count, model.matrix(~0+factor(group))) rtest <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999, BPPARAM=BiocParallel::SnowParam())
Theoretically, OVESEG-test can be applied to any molecular expression data types as long as they can be modeled by normal distribution after a transformation, such as log2-transformation for microarray and proteomics data, logCPM for RNAseq counts, logit2-transformation for DNA methylation beta values. If mean-variance relationship exists after transformation, limma::vooma()/voom()
need to be performed firstly and the resulted weight matrix is used as the input of OVESEG
.
Requirements for the input expression data:
The input expression data should be stored in a matrix. Data frame, or SummarizedExperiment object is also accepted and will be internally coerced into a matrix format before analysis. Rows correspond to probes and columns to samples. Tissue/cell labels must match each column in expression matrix. Weight matrix, if provided, must match each spot in expression matrix.
We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from GSE28490) as an example to show OVESEG-test on microarray data.
library(OVESEG) #Import data data(RocheBT) y <- RocheBT$y #5000*15 matrix group <- RocheBT$group #15 labels #filter low-expressed probes groupMean <- sapply(levels(group), function(x) rowMeans(y[,group == x])) groupMeanMax <- apply(groupMean, 1, max) keep <- (groupMeanMax > log2(64)) y <- y[keep,] #OVESEG-test rtest1 <- OVESEGtest(y, group, NumPerm = 999, BPPARAM=BiocParallel::SnowParam())
Note there are many low-expressed probe filtering methods. Here we filtered the probes whose mean expression is less than a threshold even in the highest expressed group. If mean-variance relationship is obvious, we can consider to apply limma::vooma()
firstly.
#OVESEG-test yvooma <- limma::vooma(y, model.matrix(~0+factor(group))) rtest2 <- OVESEGtest(yvooma$E, group, weights=yvooma$weights, NumPerm = 999, BPPARAM=BiocParallel::SnowParam())
We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from GSE60424) as an example to show OVESEG-test on RNAseq count data.
library(OVESEG) #Import data data(countBT) count <- countBT$count #10000*60 matrix group <- countBT$group #60 labels #filter low-expressed probes groupMean <- sapply(levels(group), function(x) rowMeans(count[,group == x])) groupMeanMax <- apply(groupMean, 1, max) keep <- (groupMeanMax > 2) count <- count[keep,] #OVESEG-test lib.size <- max(colSums(count)) yvoom <- limma::voom(count, model.matrix(~0+factor(group)), lib.size = lib.size) rtest3 <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999, BPPARAM=BiocParallel::SnowParam())
Note there are many low-expressed probe filtering methods. Here we filtered the probes whose mean count is less than a threshold even in the highest expressed group. lib.size
and other parameters in limma::voom()
can be set manually according to r Biocpkg("limma")
package.
OVESEGtest
returns three p-values: pv.overall
, pv.oneside
and pv.multiside
. pv.overall
is calculated by all permutations regardless of upregulated subtypes. pv.oneside
is subtype-specific p-values calculated specifically for the upregulated subtype of each probe. pv.multiside
is pv.oneside
multiplied by K (K comparison correction) and truncated at 1. More details can be found in the paper [@oveseg].
OVESEG-test statistics are defined as $$\mathbf{\max_{k=1,...,K}\left{min_{l \neq k}\left{ \frac{\mu_k(j)-\mu_l(j)}{\sigma(j)\sqrt{\frac{1}{N_k}+\frac{1}{N_l}}} \right}\right}}$$ where $\mathbf{\mu_k(j)}$ is the mean of logarithmic expressions of gene j in subtype k. While it takes a long time to execute permutations for p-value estimation, OVESEGtstat()
is useful if users only need test statistics for ranking genes:
#OVESEG-test statistics rtstat1 <- OVESEGtstat(y, RocheBT$group) rtstat2 <- OVESEGtstat(yvooma$E, RocheBT$group, weights=yvooma$weights) rtstat3 <- OVESEGtstat(yvoom$E, countBT$group, weights=yvoom$weights)
Null hypothesis of OVESEG-test is modeled as a mixture of multiple components, where the weights of components are estimated from posterior probabilities over all probes. Those posterior probabilities are also returned by OVESEGtest()
. If users only want to observe probewise posterior probabilities, postProbNull()
is helpful.
ppnull1 <- postProbNull(y, RocheBT$group) ppnull2 <- postProbNull(yvooma$E, RocheBT$group, weights=yvooma$weights) ppnull3 <- postProbNull(yvoom$E, countBT$group, weights=yvoom$weights)
By accumulating and normalizing probewise posterior probability with the function nullDistri()
, we can also obtain the probability of one subtype being upregulated conditioned on null hypotheses. The subtype with higher probability tends to get more False Positive MGs.
nullDistri(ppnull1) nullDistri(ppnull2) nullDistri(ppnull3)
There are totally $\mathbf{2^K-1}$ types of expression patterns in a real dataset: probes exclusively expressed in 1,2,…, or K of subtypes. Probe unexpressed in any of subtypes should have been filtered during pre-processing. Accumulating and normalizing probewise posterior probability of null hypotheses and of alternative hypotheses using the function patternDistri()
can present us the ratios of all possible upregulation patterns among subtypes:
patterns <- patternDistri(ppnull1) #or patterns <- patternDistri(rtest1)
The ratios of patterns can be illustrated as following.
library(gridExtra) library(grid) library(reshape2) library(ggplot2) gridpatterns <- function (ppnull) { K <- length(ppnull$label) cellcomp <- patternDistri(ppnull) cellcomp <- cellcomp[order(cellcomp[,K+1], decreasing = TRUE),] #Bar Chart to show pattern percentage DF1 <- data.frame(Rank = seq_len(2^K - 1), cellcomp) p1 <- ggplot(DF1, aes(x = Rank, y = Wpattern)) + geom_bar(stat = "identity") + scale_y_continuous(labels=scales::percent) + theme_bw(base_size = 16) + theme(axis.text.x = element_blank(), axis.title.x = element_blank(), plot.margin=unit(c(1,1,-0.2,1), "cm"), panel.grid.minor = element_line(size = 1), panel.grid.major = element_line(size = 1), panel.border = element_blank(), axis.ticks.x = element_blank()) + ylab("Percentage of up in certain subtype(s)") #patterns as X-axis DF2 <- data.frame(Rank = seq_len(2^K-1), cellcomp[,-(K+1)]) DF2 <- melt(DF2, id.var="Rank") p2 <- ggplot(DF2, aes(x = Rank, y = value, fill = variable)) + geom_bar(stat = "identity") + scale_fill_brewer(palette="Set2") + theme(line = element_blank(), axis.text.x = element_blank(), axis.text.y = element_blank(), title = element_blank(), panel.background = element_rect(fill = "white"), legend.position="bottom", legend.key.size = unit(1.2,"line"), plot.margin=unit(c(-0.2,1,1,1), "cm")) + scale_y_reverse() + guides(fill = guide_legend(nrow = 1, byrow = TRUE)) g1 <- ggplotGrob(p1) g2 <- ggplotGrob(p2) colnames(g1) <- paste0(seq_len(ncol(g1))) colnames(g2) <- paste0(seq_len(ncol(g2))) g <- gtable_combine(g1, g2, along=2) panels <- g$layout$t[grepl("panel", g$layout$name)] n1 <- length(g$heights[panels[1]]) n2 <- length(g$heights[panels[2]]) # assign new (relative) heights to the panels # notice the *4 here to get different heights g$heights[panels[1]] <- unit(n1*4,"null") g$heights[panels[2]] <- unit(n2,"null") return(g) } grid.newpage() grid.draw(gridpatterns(ppnull1)) #or grid.draw(gridpatterns(rtest1))
Default variance estimator in OVESEG
is empirical Bayes moderated variance estimator used in r Biocpkg("limma")
(argument alpha="moderated"
). Another option is adding a constant $\alpha$ to pooled variance estimator (argument alpha=
$\alpha$). Setting argument alpha=NULL
will treat all variances as the same constant.
Before MG detection, OVESEG-test p-values still need multiple comparison correction or false discovery rate control.
##multiple comparison correction pv.overall.adj <- stats::p.adjust(rtest1$pv.overall, method="bonferroni") pv.multiside.adj <- stats::p.adjust(rtest1$pv.multiside, method="bonferroni") ##fdr control qv.overall <- fdrtool::fdrtool(rtest1$pv.overall, statistic="pvalue", plot=FALSE, verbose=FALSE)$qval qv.multiside <- fdrtool::fdrtool(rtest1$pv.multiside, statistic="pvalue", plot=FALSE, verbose=FALSE)$qval
sessionInfo()
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.