knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(PIMseq)

Introduction

We propose a semi-parametric method based on Probabilistic Index Models (PIM, @thas2012 and @de2015) for testing differential expression (DE) in single cell RNA-seq data. PIM entails a large class of semi-parametric models that can generate many of the traditional rank tests, such as the Wilcoxon-Mann-Whitney test and the Kruskal-Wallis test [@de2015]. These models can be seen as the rank-equivalent of the generalized linear models (GLM). PIM methods do not rely on strong distributional assumptions and they inherit the robustness of rank methods. Moreover, PIM comes with parameters that possess an informative interpretation as effect sizes. PIM can be used for complex experimental designs involving multiple groups of cells, and multiple discrete or continuous factors. PIM can also account for the different known sources of variation such as library size difference, batch, and cell cycle stage. Moreover, PIM can be applied on the raw count data without the need for a normalization preprocessing step, and it can also be integrated with other data preprocessing methods, such normalization, imputation, and removal of unwanted variation.

To illustrate how PIM can be implemented for testing DE in scRNA-seq data, we first focus on a simple scenario with two groups of cells. Let $Y_{gi}$ denote the normalized gene expression of gene $g=1,2,\dots,G$ in cell $i=1,2,\dots,n$. Let covariate $X=A$ be the treatment indicator, such that $A_i=1$ if cell i is in one group and $A_i=0$ if the cell is in the other group.

In a GLM the conditional mean $\mu_{Y_g|A} = \textrm{E}(Y_g|A)$ is modelled as a function of the factor $A$ through an appropriate link function $g(.)$. In contrast, a PIM models the conditional probability \begin{eqnarray} \textrm P(Y_{gi}\preccurlyeq Y_{gj} | A_{i}, A_{j}) = \textrm P(Y_{gi}< Y_{gj}| A_{i}, A_{j})+\frac{1}{2}\textrm P(Y_{gi}= Y_{gj}| A_{i}, A_{j}),\nonumber \label{eq:pim_prob} \end{eqnarray}

where the indices $i$ and $j$ refer cell $i$ and $j$ with associated grouping factor $A_i$ and $A_j$, respectively. This probability is known as the probabilistic index (PI).

We specify a PIM using the logit link function as \begin{eqnarray} \textrm{logit}\left{\textrm P(Y_{gi}\preccurlyeq Y_{gj} | A_{i}, A_{j})\right} = \beta_g(A_{j}-A_{i}).\nonumber \label{eq:pim_example} \end{eqnarray} The parameter $\beta_g\in \mathbb{R}$ represents the effect of the treatment on the PI of the outcome. This effect is expressed as $P(Y_{gi}\preccurlyeq Y_{gj} | A_{i}=0, A_{j}=1) = \exp(\beta_g)/(1+\exp(\beta_g))$. Once its estimate ($\hat\beta_g$) is obtained, the PI can be estimated as \begin{eqnarray} \hat {\textrm P}(Y_{gi}\preccurlyeq Y_{gj} | A_{i}, A_{j}) = \frac{e^{\hat\beta_g(A_{j}-A_{i})}}{1+e^{\hat\beta_g(A_{j}-A_{i})}} \in [0, 1]. \nonumber \end{eqnarray}

If there is a strong evidence that gene $g$ is DE, then the estimated PI (PI= $\hat {\textrm P}(Y_{gi}\preccurlyeq Y_{gj} | A_{i}=0, A_{j}=1)$) becomes close to

Under the null hypothesis (no DE), the estimated PI is expected to be 0.5, indicating that there is a 50\% chance that the expression of gene $g$ in a randomly selected cell from the $A=0$ group is lower than that of a randomly selected cell from the $A=1$ group (and vice versa).

Statistical hypothesis testing can be performed with the Wald-type test of @thas2012. For example, for the above model, the null hypothesis of no DE can be formulated as $H_0: \beta_g=0$. This is equivalent to testing $\textrm P(Y_{gi}\preccurlyeq Y_{gj} | A_{i}=0, A_{j}=1)= 0.5$. Applying this test to every candidate gene results in a vector of raw $p$-values, to which the usual FDR control procedures, such as the Benjamin and Hochberg [@benjamini1995controlling] approach, can be applied.

Installing PIMseq

PIM can be implemented for testing DE in gene expression data using the PIMseq package, available at \url(https://github.com/CenterForStatistics-UGent/PIMseq). The package can be installed and loaded using the following commands:

remotes::install_github("CenterForStatistics-UGent/PIMseq")

Data

To illustrate PIMseq using real scRNA-seq data, we use the NGP-nutlin-3 neuroblastoma scRNA-seq data accessed from @Verboom430090's study (GEO accession GSE119984). This data is a cellular perturbation experiment on the C1 instrument. This total RNA seq dataset contains 83 NGP neuroblastoma cells of which 31 were treated with 8$\mu$M of nutlin-3 and the other 52 cells were treated with vehicle (controls).

Examples

Example 1

In this particular example, we test for DE between nutlin-3 and vehicle group of cells with the log-library size as anuisance variable to acount for library size differences between cells.

# load required package
library(SingleCellExperiment) 

#load NGP neuroblastoma scRNA-seq data 
data("scNGP.data") 
scNGP.data2 <- scNGP.data[rowSums(counts(scNGP.data)>0)>10, ] # filter genes
colData(scNGP.data2)$treatment <- 
ifelse(colData(scNGP.data2)$characteristics..treatment=="nutlin",1,0)
knitr::kable(table(colData(scNGP.data2)$treatment)) # cells distribution
colData(scNGP.data2)$logLS <- log(colSums(counts(scNGP.data2))) # calculate log library size

# apply PIMseq. We use available cores for parallel computation using BiocParallel package. 
res.PIM <- PIMSeq(SCExp = scNGP.data2, condition = "treatment",
                  assay.name = "counts", nuisance.vars = "logLS",
                  BPPARAM = BiocParallel::SnowParam())

# View and explore results
knitr::kable(head(res.PIM$test.contrasts))  # result based on the standard PIM
knitr::kable(head(res.PIM$all.coefficients)) # (optional) estimated coefficients 
# and their standard error

#number of detected DE genes at 5% FDR
knitr::kable(table(res.PIM$test.contrasts$p.adjusted <0.05))

# number of detected DE genes at 5% FDR and PI <0.4 or PI >0.6
knitr::kable(table(res.PIM$test.contrasts$p.adjusted <0.05 & 
        (res.PIM$test.contrasts$PI<0.4 | res.PIM$test.contrasts$PI>0.6)))

# number of down-regulated DE genes at 5% FDR and PI <0.4
down.reg <- sum(res.PIM$test.contrasts$p.adjusted <0.05 & res.PIM$test.contrasts$PI<0.4)

# number of up-regulated DE genes at 5% FDR and PI > 0.6
up.reg <- sum(res.PIM$test.contrasts$p.adjusted <0.05 & res.PIM$test.contrasts$PI>0.6)

# rank genes based on their estimated PI  
par(mfrow=c(1,2))
plotPIrank(res.PIM, contrast = 1)
text(2500, 0.1, paste("#down.reg genes =", down.reg), col="blue")
text(9500, 0.9, paste("#up.reg genes =", up.reg), col="blue")
title(main="Gene ranking based on PI")

# alternative plots to volcano plot
plot(res.PIM$test.contrasts$PI, -log10(res.PIM$test.contrasts$p.value), 
     col=factor(res.PIM$test.contrasts$p.adjusted<0.05), las=1,
     xlab="PI", ylab="-log10(p value)", main="volcano plot")
legend("bottomright", c("FDR<0.05", "FDR>=0.05"), col=c(2,1), pch=19)

Example 2

In this example, we demonstrate PIM with a normalized scRNA-seq data. In particular, we use the counts per millions of reads (CPM) as a normalized data.

# load data
data("scNGP.data")
dim(scNGP.data)
scNGP.data2 <- scNGP.data[rowSums(counts(scNGP.data)>0)>10, ] 

CPM <- counts(scNGP.data2) %*% diag(1e6/colSums(counts(scNGP.data2)))
names(assays(scNGP.data2))
assays(scNGP.data2)$CPM <- CPM # there are two assay names now
names(assays(scNGP.data2))

colData(scNGP.data2)$treatment <- 
 ifelse(colData(scNGP.data2)$characteristics..treatment=="nutlin",1,0)
table(colData(scNGP.data2)$treatment)

res.PIM <- PIMSeq(SCExp = scNGP.data2, condition = "treatment",
                  assay.name = "CPM", nuisance.vars = NULL,
                  BPPARAM = BiocParallel::SnowParam()) # note that the assay name is changed


head(res.PIM$test.contrasts)  # results

# number of detected DE genes at 5% FDR
table(res.PIM$test.contrasts$p.adjusted <0.05) 

Example 3

In this example, we demonstrate PIM for complex experimental design. For this purpose, let us simulate artificial gene expression data for a 2x2 factorial design, i.e. two factors each with two levels. The analysis involves testing for the main effect for each factor and their interaction effect. Therefore, the demonstration includes applying PIM and test for a global DE, and afterwards, we test for a particular set of contrasts using testPIMcontrast() function. We use make.example.data() function to simulate data from negative binomial distribution as a quick approach. Moreover, we use the log library size a nuisance factor to normalize the data.

# simulate artificial data (1000 genes (with 20% of them are DE), 4x50 cells, 2 biological 
# groups, and 2 batch groups). Note that the simulation generates 2 biological groups of 
# cells and 2 batchs. We treat the batch as a second main factor. Each combination  
# of the two factors contain 50 replicates.

example.data <- make.example.data(n.gene = 1000, n.sample = 50, n.group = 2, 
                                  n.batch = 2, p.DEgenes = 0.2)
table(example.data$Group, example.data$Batch)

colData(example.data)$X1 <- as.factor(example.data$Group)
colData(example.data)$X2 <- as.factor(example.data$Batch)
colData(example.data)$logLS <- log(example.data$E.LS)

keep <- rowSums(counts(example.data)>0)>10 # filter genes
table(keep)
example.data <- example.data[keep, ]  # filter genes
dim(example.data)

# apply PIMseq
res.PIM2 <- PIMSeq(SCExp = example.data, condition = c("X1", "X2", "X1*X2"), 
                   nuisance.vars = "logLS", BPPARAM = BiocParallel::SnowParam())
head(res.PIM2$test.contrasts)
table(res.PIM2$test.contrasts$p.adjusted<0.05)  # number of DE genes, but b/n which group?

#Contrast 1: DE between levels of factor 1 when the level of factor 2 is 1
contrast1 <- testPIMcontrast(res.PIM2, contrasts = c(1, 0, 0)) 

#Contrast 2: DE between levels of factor 1 when the level of factor 2 is 2
contrast2 <- testPIMcontrast(res.PIM2, contrasts = c(1, 0, 1))

#Contrast 3: DE between levels of factor 2 when the level of factor 1 is 1 
contrast3 <- testPIMcontrast(res.PIM2, contrasts = c(0, 1, 0)) 

#Contrast 4: DE between levels of factor 2 when the level of factor 1 is 2
contrast4 <- testPIMcontrast(res.PIM2, contrasts = c(0, 1, 1))

table(contrast1$p.adjusted<0.05)
table(contrast2$p.adjusted<0.05)
table(contrast3$p.adjusted<0.05)
table(contrast4$p.adjusted<0.05)

References



CenterForStatistics-UGent/PIMseq documentation built on Jan. 15, 2024, 3:49 a.m.