knitr::opts_chunk$set(tidy = FALSE, cache = FALSE, dev = "png", message = FALSE, error = FALSE, warning = TRUE)
In the investigation of molecular mechanisms underlying cell state changes, a crucial analysis is to identify differentially expressed (DE) genes along a continuous cell trajectory, which can be estimated by pseudotime inference (also callsed trajectory inference) from single-cell RNA-sequencing (scRNA-seq) data. However, the uncertainty in pseudotime inference is ignored in existing methods. PseudotimeDE is designed to generate well-calibrated $p$-values. PseudotimeDE is flexible in allowing users to specify the pseudotime inference method and to choose the appropriate model for scRNA-seq data.
suppressPackageStartupMessages(library(PseudotimeDE)) suppressPackageStartupMessages(library(SingleCellExperiment)) suppressPackageStartupMessages(library(slingshot)) suppressPackageStartupMessages(library(tibble)) suppressPackageStartupMessages(library(dplyr)) suppressPackageStartupMessages(library(scales)) suppressPackageStartupMessages(library(irlba))
In this quickstart guide, we demonstrate the basic functionality of the PseudotimeDE package. PseudotimeDE package allows users to specify the pseudotime inference method, but we will use r Biocpkg("slingshot")
as the example in our analysis.
data(LPS_sce, package = "PseudotimeDE")
This Smart-seq dataset contains primary mouse dendritic cells (DCs) stimulated with lipopolysaccharide (LPS). The original data is available at Gene Expression Omnibus (GEO) under accession ID GSE45719. Here the dataset has been stored as a SingleCellExperiment
object. For more information on how to construct it, please check r Biocpkg("SingleCellExperiment")
.
We use slingshot to infer the pseudotime on the original LPS_sce
. Since we have the prior knowledge that this dataset should be a single lineage, we let slingshot to generate only one lineage.
rd <- irlba::prcomp_irlba(t(logcounts(LPS_sce)), scale. = FALSE)$x[, 1:2] reducedDims(LPS_sce) <- SimpleList(PCA = rd) colData(LPS_sce)$cl <- 1 fit_ori <- slingshot(LPS_sce, reducedDim = 'PCA', clusterLabels = "cl") LPS_ori_tbl <- tibble(cell = colnames(LPS_sce), pseudotime = rescale(colData(fit_ori)$slingPseudotime_1))
Check the input format.
head(LPS_ori_tbl)
The key step of PseudotimeDE is to use subsampling to get the uncertainty of inferred pseudotime. The default is using 80% cells. To save time, here we only generate 2 subsamples. The defulat is using 1000 subsamples, although 100 subsamples show similar performance in our empirical study. Of course, more subsamples will lead to more accurate results.
set.seed(123) ## Set the cores for parallelization. Note that mclapply does not work on Windows. options(mc.cores = 2) BPPARAM <- BiocParallel::bpparam() BPPARAM$workers <- 2 LPS_index <- BiocParallel::bplapply(seq_len(2) , function(x, LPS_sce) { suppressPackageStartupMessages(library(SingleCellExperiment)) sample(x = c(1:dim(LPS_sce)[2]), size = 0.8*dim(LPS_sce)[2], replace = FALSE) }, LPS_sce = LPS_sce, BPPARAM = BPPARAM)
Users should use the exact same procedure as they used in pseudotime inference on the original dataset.
LPS_sub_tbl <- BiocParallel::bplapply(LPS_index, function(x, sce, LPS_ori_tbl) { suppressPackageStartupMessages(library(SingleCellExperiment)) suppressPackageStartupMessages(library(slingshot)) suppressPackageStartupMessages(library(tibble)) suppressPackageStartupMessages(library(scales)) suppressPackageStartupMessages(library(dplyr)) sce <- sce[, x] rd <- irlba::prcomp_irlba(t(logcounts(sce)), scale. = FALSE)$x[, 1:2] reducedDims(sce) <- SimpleList(PCA = rd) fit <- slingshot(sce, reducedDim = 'PCA', clusterLabels = "cl") tbl <- tibble(cell = colnames(sce), pseudotime = rescale(colData(fit)$slingPseudotime_1)) ## Make sure the direction of pseudotime is the same as the original pseudotime merge.tbl <- left_join(tbl, LPS_ori_tbl, by = "cell") if(cor(merge.tbl$pseudotime.x, merge.tbl$pseudotime.y) < 0) { tbl <- dplyr::mutate(tbl, pseudotime = 1-pseudotime) } tbl }, sce = LPS_sce, LPS_ori_tbl = LPS_ori_tbl ,BPPARAM = BPPARAM)
We load the example LPS_ori_tbl
and LPS_sub_tbl
for the next DE test. The LPS_sub_tbl
contains the 1000 subsample-pseudotimes.
data(LPS_ori_tbl, package = "PseudotimeDE") data(LPS_sub_tbl, package = "PseudotimeDE")
Users can check the spreadness of the inferred pseudotime on subsamples.
PseudotimeDE::plotUncertainty(LPS_ori_tbl, LPS_sub_tbl)
To save time, we only run the DE test on two example genes (CCL5, CXCL10) and 100 subsamples. Note that the DE test can be time-consuming since it is a permutation test; we strongly encourage users to allocate at least 10 cores. We specify the distribution as Negative Binomial (nb
) here.
system.time(res <- PseudotimeDE::runPseudotimeDE(gene.vec = c("CCL5", "CXCL10"), ori.tbl = LPS_ori_tbl, sub.tbl = LPS_sub_tbl[1:100], ## To save time, use 100 subsamples mat = LPS_sce, ## You can also use a matrix or SeuratObj as the input model = "nb", mc.cores = 2))
We now check the output.
print(res)
para.pv
is the most important output - the $p$-values of DE test. rank
measures how wiggling the gene trajectory is. gam.fit
is the fitted Generalized Additive Model (GAM).
Although we do not encourage this, users may choose completely ignore the uncertainty of inferred pseudotime, and use the $p$-values based on an asymptotic distribution (fix.pv
). Note that fix.pv
usually do not behave correctly under the null (i.g., following a $\operatorname{Uniform}(0, 1)$ distribution). The good thing is that only calculating fix.pv
makes the DE test very fast.
system.time(res_fix <- PseudotimeDE::runPseudotimeDE(gene.vec = c("CCL5", "CXCL10"), ori.tbl = LPS_ori_tbl, sub.tbl = NULL, # Set as NULL to only get fix.pv mat = LPS_sce, model = "nb", mc.cores = 2))
print(res_fix)
We can visualize the gene trajectories estimated by gam.fit
.
PseudotimeDE::plotCurve(gene.vec = res$gene, ori.tbl = LPS_ori_tbl, mat = LPS_sce, model.fit = res$gam.fit)
The "dropout" problem (extra zeros) can be captured by using Zero-Inflated Negative-Binomial model (ZINB). In general, we believe model = 'nb'
should be used in most cases since using ZINB may cause lower power (see this paper Naught all zeros in sequence count data are the same). Another disadvantage is that ZINB is more time-consuming than NB. However, if needed, users may choose the auto decision (model = 'auto'
) or force to use ZINB (model = 'zinb'
).
system.time(res_zinb <- PseudotimeDE::runPseudotimeDE(gene.vec = c("CCL5", "CXCL10"), ori.tbl = LPS_ori_tbl, sub.tbl = LPS_sub_tbl[1:100], mat = LPS_sce, model = "zinb", mc.cores = 2)) print(res_zinb)
We can check the fitted curve by zinb
.
PseudotimeDE::plotCurve(gene.vec = res_zinb$gene, ori.tbl = LPS_ori_tbl, mat = LPS_sce, model.fit = res_zinb$gam.fit)
We also added Gaussian (Normal) distribution as an distribution option into PseudotimeDE. If your input is log-transformed counts (so they are close to Gaussian), you can set model = 'gaussian'
. Gaussian is usually faster than NB.
# first generate the logcount and added it to LPS_sce the_counts <- counts(LPS_sce) logcounts(LPS_sce) <- log1p(the_counts) system.time(res_gaussian <- PseudotimeDE::runPseudotimeDE(gene.vec = c("CCL5", "CXCL10"), ori.tbl = LPS_ori_tbl, sub.tbl = LPS_sub_tbl[1:100], mat = LPS_sce, model = "gaussian", assay.use = "logcounts", mc.cores = 2))
print(res_gaussian)
We can check the fitted curve by gaussian
.
PseudotimeDE::plotCurve(gene.vec = res_gaussian$gene, ori.tbl = LPS_ori_tbl, mat = LPS_sce, model.fit = res_gaussian$gam.fit, assay.use = "logcounts")
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.