OVESEG User Manual

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"   
)

Introduction

OVESEG-test (One Versus Everyone Subtype Exclusively-expressed Genes test) is a statistically-principled multiple-group comparison method that can detect tissue/cell-specific marker genes (MGs) among many subtypes (e.g. tissue/cell types) [@oveseg]. To assess the statistical significance of MGs, OVESEG-test uses a specifically designed test statistics that mathematically matches the definition of MGs, and employs a novel permutation scheme to estimate the corresponding distribution under null hypothesis where the expression patterns of non-MGs can be highly complex.

OVESEG package provides functions to compute OVESEG-test statistics, derive component weights in the mixture null distribution model and estimate p-values from weightedly aggregated permutations. While OVESEG-test p-values can be used to detect MGs, component weights of hypotheses also help portrait all kinds of upregulation patterns among tissue/cell types.

Quick Start

The function OVESEGtest() includes all necessary steps to obtain OVESEG-test p-values. You need to specify the expression matrix, tissue/cell type labels, number of permutations, and parallel option. For microarray data, the expressions need to be log2-transformed prior to the test. For RNAseq count data, the counts need to be transformed to logCPM (log2-counts per million) prior to the test and use limma::voom() to incorporate the mean-variance relationship. Parallel option is set according to r Biocpkg("BiocParallel") package.

#Microarray
rtest <- OVESEGtest(y, group, NumPerm=999, 
                    BPPARAM=BiocParallel::SnowParam())

#RNAseq
yvoom <- limma::voom(count, model.matrix(~0+factor(group)))
rtest <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999, 
                    BPPARAM=BiocParallel::SnowParam())

OVESEG-test

Datatypes and Input Format

Theoretically, OVESEG-test can be applied to any molecular expression data types as long as they can be modeled by normal distribution after a transformation, such as log2-transformation for microarray and proteomics data, logCPM for RNAseq counts, logit2-transformation for DNA methylation beta values. If mean-variance relationship exists after transformation, limma::vooma()/voom() need to be performed firstly and the resulted weight matrix is used as the input of OVESEG.

Requirements for the input expression data:

The input expression data should be stored in a matrix. Data frame, or SummarizedExperiment object is also accepted and will be internally coerced into a matrix format before analysis. Rows correspond to probes and columns to samples. Tissue/cell labels must match each column in expression matrix. Weight matrix, if provided, must match each spot in expression matrix.

Microarray data

We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from GSE28490) as an example to show OVESEG-test on microarray data.

library(OVESEG)
#Import data
data(RocheBT)
y <- RocheBT$y #5000*15 matrix
group <- RocheBT$group #15 labels

#filter low-expressed probes
groupMean <- sapply(levels(group), function(x) rowMeans(y[,group == x]))
groupMeanMax <- apply(groupMean, 1, max)
keep <- (groupMeanMax > log2(64))
y <- y[keep,]

#OVESEG-test
rtest1 <- OVESEGtest(y, group, NumPerm = 999, 
                     BPPARAM=BiocParallel::SnowParam())

Note there are many low-expressed probe filtering methods. Here we filtered the probes whose mean expression is less than a threshold even in the highest expressed group. If mean-variance relationship is obvious, we can consider to apply limma::vooma() firstly.

#OVESEG-test
yvooma <- limma::vooma(y, model.matrix(~0+factor(group)))
rtest2 <- OVESEGtest(yvooma$E, group, weights=yvooma$weights, NumPerm = 999,
                     BPPARAM=BiocParallel::SnowParam())

RNAseq count data

We use a data set of purified B cells, CD4+ T cells and CD8+ T cells (downsampled from GSE60424) as an example to show OVESEG-test on RNAseq count data.

library(OVESEG)
#Import data
data(countBT)
count <- countBT$count #10000*60 matrix
group <- countBT$group #60 labels

#filter low-expressed probes
groupMean <- sapply(levels(group), function(x) rowMeans(count[,group == x]))
groupMeanMax <- apply(groupMean, 1, max)
keep <- (groupMeanMax > 2)
count <- count[keep,]

#OVESEG-test
lib.size <- max(colSums(count))
yvoom <- limma::voom(count, model.matrix(~0+factor(group)), 
                     lib.size = lib.size)
rtest3 <- OVESEGtest(yvoom$E, group, weights=yvoom$weights, NumPerm = 999,
                     BPPARAM=BiocParallel::SnowParam())

Note there are many low-expressed probe filtering methods. Here we filtered the probes whose mean count is less than a threshold even in the highest expressed group. lib.size and other parameters in limma::voom() can be set manually according to r Biocpkg("limma") package.

p-values

OVESEGtest returns three p-values: pv.overall, pv.oneside and pv.multiside. pv.overall is calculated by all permutations regardless of upregulated subtypes. pv.oneside is subtype-specific p-values calculated specifically for the upregulated subtype of each probe. pv.multiside is pv.oneside multiplied by K (K comparison correction) and truncated at 1. More details can be found in the paper [@oveseg].

Useful intermediate results

Test statistics

OVESEG-test statistics are defined as $$\mathbf{\max_{k=1,...,K}\left{min_{l \neq k}\left{ \frac{\mu_k(j)-\mu_l(j)}{\sigma(j)\sqrt{\frac{1}{N_k}+\frac{1}{N_l}}} \right}\right}}$$ where $\mathbf{\mu_k(j)}$ is the mean of logarithmic expressions of gene j in subtype k. While it takes a long time to execute permutations for p-value estimation, OVESEGtstat() is useful if users only need test statistics for ranking genes:

#OVESEG-test statistics
rtstat1 <- OVESEGtstat(y, RocheBT$group)
rtstat2 <- OVESEGtstat(yvooma$E, RocheBT$group, weights=yvooma$weights)
rtstat3 <- OVESEGtstat(yvoom$E, countBT$group, weights=yvoom$weights)

Posterior probabilities of null hypothesis components

Null hypothesis of OVESEG-test is modeled as a mixture of multiple components, where the weights of components are estimated from posterior probabilities over all probes. Those posterior probabilities are also returned by OVESEGtest(). If users only want to observe probewise posterior probabilities, postProbNull() is helpful.

ppnull1 <- postProbNull(y, RocheBT$group)
ppnull2 <- postProbNull(yvooma$E, RocheBT$group, weights=yvooma$weights)
ppnull3 <- postProbNull(yvoom$E, countBT$group, weights=yvoom$weights)

By accumulating and normalizing probewise posterior probability with the function nullDistri(), we can also obtain the probability of one subtype being upregulated conditioned on null hypotheses. The subtype with higher probability tends to get more False Positive MGs.

nullDistri(ppnull1)
nullDistri(ppnull2)
nullDistri(ppnull3)

There are totally $\mathbf{2^K-1}$ types of expression patterns in a real dataset: probes exclusively expressed in 1,2,…, or K of subtypes. Probe unexpressed in any of subtypes should have been filtered during pre-processing. Accumulating and normalizing probewise posterior probability of null hypotheses and of alternative hypotheses using the function patternDistri() can present us the ratios of all possible upregulation patterns among subtypes:

patterns <- patternDistri(ppnull1)
#or
patterns <- patternDistri(rtest1)

The ratios of patterns can be illustrated as following.

library(gridExtra)
library(grid)
library(reshape2)
library(ggplot2)

gridpatterns <- function (ppnull) {
    K <- length(ppnull$label)
    cellcomp <- patternDistri(ppnull)
    cellcomp <- cellcomp[order(cellcomp[,K+1], decreasing = TRUE),]

    #Bar Chart to show pattern percentage
    DF1 <- data.frame(Rank = seq_len(2^K - 1), cellcomp)
    p1 <- ggplot(DF1, aes(x = Rank, y = Wpattern)) +
        geom_bar(stat = "identity") +
        scale_y_continuous(labels=scales::percent) +
        theme_bw(base_size = 16) +
        theme(axis.text.x = element_blank(),
              axis.title.x = element_blank(),
              plot.margin=unit(c(1,1,-0.2,1), "cm"),
              panel.grid.minor = element_line(size = 1),
              panel.grid.major = element_line(size = 1),
              panel.border = element_blank(),
              axis.ticks.x = element_blank()) +
        ylab("Percentage of up in certain subtype(s)")

    #patterns as X-axis
    DF2 <- data.frame(Rank = seq_len(2^K-1), cellcomp[,-(K+1)])
    DF2 <- melt(DF2, id.var="Rank")
    p2 <- ggplot(DF2, aes(x = Rank, y = value, fill = variable)) +
        geom_bar(stat = "identity") +
        scale_fill_brewer(palette="Set2") +
        theme(line = element_blank(),
              axis.text.x = element_blank(),
              axis.text.y = element_blank(),
              title = element_blank(),
              panel.background = element_rect(fill = "white"),
              legend.position="bottom",
              legend.key.size = unit(1.2,"line"),
              plot.margin=unit(c(-0.2,1,1,1), "cm")) +
        scale_y_reverse() +
        guides(fill = guide_legend(nrow = 1, byrow = TRUE))

    g1 <- ggplotGrob(p1)
    g2 <- ggplotGrob(p2)
    colnames(g1) <- paste0(seq_len(ncol(g1)))
    colnames(g2) <- paste0(seq_len(ncol(g2)))
    g <- gtable_combine(g1, g2, along=2)
    panels <- g$layout$t[grepl("panel", g$layout$name)]
    n1 <- length(g$heights[panels[1]])
    n2 <- length(g$heights[panels[2]])
    # assign new (relative) heights to the panels
    # notice the *4 here to get different heights
    g$heights[panels[1]] <- unit(n1*4,"null")
    g$heights[panels[2]] <- unit(n2,"null")
    return(g)
} 

grid.newpage()
grid.draw(gridpatterns(ppnull1))
#or grid.draw(gridpatterns(rtest1))

Notes

Default variance estimator in OVESEG is empirical Bayes moderated variance estimator used in r Biocpkg("limma") (argument alpha="moderated"). Another option is adding a constant $\alpha$ to pooled variance estimator (argument alpha=$\alpha$). Setting argument alpha=NULL will treat all variances as the same constant.

Before MG detection, OVESEG-test p-values still need multiple comparison correction or false discovery rate control.

##multiple comparison correction 
pv.overall.adj <- stats::p.adjust(rtest1$pv.overall, method="bonferroni")
pv.multiside.adj <- stats::p.adjust(rtest1$pv.multiside, method="bonferroni")

##fdr control
qv.overall <- fdrtool::fdrtool(rtest1$pv.overall, statistic="pvalue",
                               plot=FALSE, verbose=FALSE)$qval
qv.multiside <- fdrtool::fdrtool(rtest1$pv.multiside, statistic="pvalue",
                                 plot=FALSE, verbose=FALSE)$qval

sessionInfo

sessionInfo()

References



Try the OVESEG package in your browser

Any scripts or data that you put into this service are public.

OVESEG documentation built on Nov. 8, 2020, 4:51 p.m.