In SONGDONGYUAN1994/PseudotimeDE: Inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

knitr::opts_chunk$set(tidy = FALSE, cache = FALSE, dev = "png",
                      message = FALSE, error = FALSE, warning = TRUE)

Introduction

In the investigation of molecular mechanisms underlying cell state changes, a crucial analysis is to identify differentially expressed (DE) genes along a continuous cell trajectory, which can be estimated by pseudotime inference (also callsed trajectory inference) from single-cell RNA-sequencing (scRNA-seq) data. However, the uncertainty in pseudotime inference is ignored in existing methods. PseudotimeDE is designed to generate well-calibrated $p$-values. PseudotimeDE is flexible in allowing users to specify the pseudotime inference method and to choose the appropriate model for scRNA-seq data.

suppressPackageStartupMessages(library(PseudotimeDE))
suppressPackageStartupMessages(library(SingleCellExperiment))
suppressPackageStartupMessages(library(slingshot))
suppressPackageStartupMessages(library(tibble))
suppressPackageStartupMessages(library(dplyr))
suppressPackageStartupMessages(library(scales))
suppressPackageStartupMessages(library(irlba))

In this quickstart guide, we demonstrate the basic functionality of the PseudotimeDE package. PseudotimeDE package allows users to specify the pseudotime inference method, but we will use r Biocpkg("slingshot") as the example in our analysis.

LPS dataset

data(LPS_sce, package = "PseudotimeDE")

This Smart-seq dataset contains primary mouse dendritic cells (DCs) stimulated with lipopolysaccharide (LPS). The original data is available at Gene Expression Omnibus (GEO) under accession ID GSE45719. Here the dataset has been stored as a SingleCellExperiment object. For more information on how to construct it, please check r Biocpkg("SingleCellExperiment").

Perform pseudotime inference on the original dataset

We use slingshot to infer the pseudotime on the original LPS_sce. Since we have the prior knowledge that this dataset should be a single lineage, we let slingshot to generate only one lineage.

rd <- irlba::prcomp_irlba(t(logcounts(LPS_sce)), scale. = FALSE)$x[, 1:2]

reducedDims(LPS_sce) <- SimpleList(PCA = rd)
colData(LPS_sce)$cl <- 1

fit_ori <- slingshot(LPS_sce, reducedDim = 'PCA', clusterLabels = "cl")
LPS_ori_tbl <- tibble(cell = colnames(LPS_sce), pseudotime = rescale(colData(fit_ori)$slingPseudotime_1))

Check the input format.

head(LPS_ori_tbl)

Perform pseudotime inference on subsamples

The key step of PseudotimeDE is to use subsampling to get the uncertainty of inferred pseudotime. The default is using 80% cells. To save time, here we only generate 2 subsamples. The defulat is using 1000 subsamples, although 100 subsamples show similar performance in our empirical study. Of course, more subsamples will lead to more accurate results.

set.seed(123)
## Set the cores for parallelization. Note that mclapply does not work on Windows.
options(mc.cores = 2)
BPPARAM <- BiocParallel::bpparam()
BPPARAM$workers <- 2
LPS_index <- BiocParallel::bplapply(seq_len(2) , function(x, LPS_sce) {
  suppressPackageStartupMessages(library(SingleCellExperiment))
  sample(x = c(1:dim(LPS_sce)[2]), size = 0.8*dim(LPS_sce)[2], replace = FALSE)
}, LPS_sce = LPS_sce, BPPARAM = BPPARAM)

Users should use the exact same procedure as they used in pseudotime inference on the original dataset.

LPS_sub_tbl <- BiocParallel::bplapply(LPS_index, function(x, sce, LPS_ori_tbl) {
  suppressPackageStartupMessages(library(SingleCellExperiment))
  suppressPackageStartupMessages(library(slingshot))
  suppressPackageStartupMessages(library(tibble))
  suppressPackageStartupMessages(library(scales))
  suppressPackageStartupMessages(library(dplyr))
  sce <- sce[, x]
  rd <- irlba::prcomp_irlba(t(logcounts(sce)), scale. = FALSE)$x[, 1:2]
  reducedDims(sce) <- SimpleList(PCA = rd)

  fit <- slingshot(sce, reducedDim = 'PCA', clusterLabels = "cl")
  tbl <- tibble(cell = colnames(sce), pseudotime = rescale(colData(fit)$slingPseudotime_1))

  ## Make sure the direction of pseudotime is the same as the original pseudotime
  merge.tbl <- left_join(tbl, LPS_ori_tbl, by = "cell")

  if(cor(merge.tbl$pseudotime.x, merge.tbl$pseudotime.y) < 0) {
    tbl <- dplyr::mutate(tbl, pseudotime = 1-pseudotime)
  }
  tbl
}, sce = LPS_sce, LPS_ori_tbl = LPS_ori_tbl ,BPPARAM = BPPARAM)

We load the example LPS_ori_tbl and LPS_sub_tbl for the next DE test. The LPS_sub_tbl contains the 1000 subsample-pseudotimes.

data(LPS_ori_tbl, package = "PseudotimeDE")
data(LPS_sub_tbl, package = "PseudotimeDE")

Users can check the spreadness of the inferred pseudotime on subsamples.

PseudotimeDE::plotUncertainty(LPS_ori_tbl, LPS_sub_tbl)

Perform DE test

To save time, we only run the DE test on two example genes (CCL5, CXCL10) and 100 subsamples. Note that the DE test can be time-consuming since it is a permutation test; we strongly encourage users to allocate at least 10 cores. We specify the distribution as Negative Binomial (nb) here.

system.time(res <- PseudotimeDE::runPseudotimeDE(gene.vec = c("CCL5", "CXCL10"),
                                     ori.tbl = LPS_ori_tbl,
                                     sub.tbl = LPS_sub_tbl[1:100], ## To save time, use 100 subsamples
                                     mat = LPS_sce, ## You can also use a matrix or SeuratObj as the input
                                     model = "nb",
                                     mc.cores = 2))

We now check the output.

print(res)

para.pv is the most important output - the $p$-values of DE test. rank measures how wiggling the gene trajectory is. gam.fit is the fitted Generalized Additive Model (GAM).

Although we do not encourage this, users may choose completely ignore the uncertainty of inferred pseudotime, and use the $p$-values based on an asymptotic distribution (fix.pv). Note that fix.pv usually do not behave correctly under the null (i.g., following a $\operatorname{Uniform}(0, 1)$ distribution). The good thing is that only calculating fix.pv makes the DE test very fast.

system.time(res_fix <- PseudotimeDE::runPseudotimeDE(gene.vec = c("CCL5", "CXCL10"),
                                     ori.tbl = LPS_ori_tbl,
                                     sub.tbl = NULL, # Set as NULL to only get fix.pv
                                     mat = LPS_sce,
                                     model = "nb",
                                     mc.cores = 2))

print(res_fix)

We can visualize the gene trajectories estimated by gam.fit.

PseudotimeDE::plotCurve(gene.vec = res$gene,
                                        ori.tbl = LPS_ori_tbl,
                                        mat = LPS_sce,
                                        model.fit = res$gam.fit)

The "dropout" problem (extra zeros) can be captured by using Zero-Inflated Negative-Binomial model (ZINB). In general, we believe model = 'nb' should be used in most cases since using ZINB may cause lower power (see this paper Naught all zeros in sequence count data are the same). Another disadvantage is that ZINB is more time-consuming than NB. However, if needed, users may choose the auto decision (model = 'auto') or force to use ZINB (model = 'zinb').

system.time(res_zinb <- PseudotimeDE::runPseudotimeDE(gene.vec = c("CCL5", "CXCL10"),
                                     ori.tbl = LPS_ori_tbl,
                                     sub.tbl = LPS_sub_tbl[1:100],
                                     mat = LPS_sce,
                                     model = "zinb",
                                     mc.cores = 2))

print(res_zinb)

We can check the fitted curve by zinb.

PseudotimeDE::plotCurve(gene.vec = res_zinb$gene,
                                        ori.tbl = LPS_ori_tbl,
                                        mat = LPS_sce,
                                        model.fit = res_zinb$gam.fit)

We also added Gaussian (Normal) distribution as an distribution option into PseudotimeDE. If your input is log-transformed counts (so they are close to Gaussian), you can set model = 'gaussian'. Gaussian is usually faster than NB.

# first generate the logcount and added it to LPS_sce
the_counts <- counts(LPS_sce) 
logcounts(LPS_sce) <- log1p(the_counts)
system.time(res_gaussian <- PseudotimeDE::runPseudotimeDE(gene.vec = c("CCL5", "CXCL10"),
                                     ori.tbl = LPS_ori_tbl,
                                     sub.tbl = LPS_sub_tbl[1:100],
                                     mat = LPS_sce,
                                     model = "gaussian", 
                                     assay.use = "logcounts",
                                     mc.cores = 2))

print(res_gaussian)

We can check the fitted curve by gaussian.

PseudotimeDE::plotCurve(gene.vec = res_gaussian$gene,
                                        ori.tbl = LPS_ori_tbl,
                                        mat = LPS_sce,
                                        model.fit = res_gaussian$gam.fit, assay.use = "logcounts")

Session Info

sessionInfo()

SONGDONGYUAN1994/PseudotimeDE documentation built on March 29, 2025, 12:09 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

SONGDONGYUAN1994/PseudotimeDE
Inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

In SONGDONGYUAN1994/PseudotimeDE: Inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

Introduction

LPS dataset

Perform pseudotime inference on the original dataset

Perform pseudotime inference on subsamples

Perform DE test

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

SONGDONGYUAN1994/PseudotimeDE Inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

In SONGDONGYUAN1994/PseudotimeDE: Inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data

Introduction

LPS dataset

Perform pseudotime inference on the original dataset

Perform pseudotime inference on subsamples

Perform DE test

Session Info

R Package Documentation

Browse R Packages

We want your feedback!

SONGDONGYUAN1994/PseudotimeDE
Inference of differential gene expression along cell pseudotime with well-calibrated p-values from single-cell RNA sequencing data