In wmacnair/psupertime: Psupertime is supervised pseudotime for single cell RNAseq data

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  package.startup.message = FALSE
)

`psupertime` overview

psupertime is an R package for analysing single cell RNA-seq data where groups of the cells have labels with a known or expected sequence (for example, samples from a time series experiment day1, day2, ..., day5). It uses ordinal logistic regression to identify a set of genes which recapitulate the group-level sequence for individual cells.

In this short vignette, we show a simple example of psupertime, corresponding to the data in Figure 1D of the biorxiv manuscript. As single cell RNA-seq datasets are typically large, we have only included one small dataset in the core psupertime package. To allow replication of all the examples in the psupertime paper, we have also made the psupplementary package, here.

Basic usage

psupertime requires two inputs:

x, containing single cell log pseudocount data, either as a SingleCellExperiment object or a matrix with rows as genes and columns as cells; and
y, a factor with labels for each cell, where the sequence of the factor levels is the required sequential label ordering.

We demonstrate psupertime on a small example dataset, comprising pancreas cells taken from donors of varying ages. The most straightforward way to run psupertime is very simple:

# load psupertime package
suppressPackageStartupMessages({
  library('psupertime')
  library('SingleCellExperiment')
  })

# load the data
data(acinar_hvg_sce)

# run psupertime
y           = acinar_hvg_sce$donor_age
psuper_obj  = psupertime(acinar_hvg_sce, y, sel_genes='all')
psuper_obj

(Calling psuper_obj or print(psuper_obj) gives a quick summary of the fitted model.)

Here, we ran psupertime using all genes, by specifying sel_genes='all'. Typically we would increase speed by restricting the analysis to only interesting genes, for example the default setting is to restrict to only highly varying genes (HVGs), as described in scran. To keep this main package light, we pre-selected the highly variable genes (in acinar_hvg_sce). The standard way to call psupertime, with a full-size dataset (i.e. without pre-selection of genes) is like this:

## not run
# psuper_obj  = psupertime(x, y)

`psupertime` outputs

Once you have run psupertime, you can produce a range of plots to check the outputs, for example:

a diagnostic plot of the psupertime fitting process, to check how accurately psupertime was able to recapitulate the sequence, and the level of regularization selected;
the distribution of the label sequence along the learned pseudotime;
the genes with the largest absolute coefficients learned by psupertime; and
the expression profiles over the individual cells for these genes.

Model diagnostics

The plot below shows how several measures of performance are affected by the extent of regularization, $\lambda$. The x-axis shows $\lambda$, indicating how strongly the model tries to set coefficients to zero. The optimal value of $\lambda$ is the one which gives the best mean performance over the training data, based on one of two possible measures of performance.

g       = plot_train_results(psuper_obj)
(g)

The first row shows classification error, namely the proportion of cells for which psupertime predicted the wrong label (equivalent to 1 - accuracy). The second row is cross-entropy, which quantifies how confidently the psupertime classifier predicts the correct label (so predicting the correct label with probability $p=0.9$ results in a lower cross-entropy than with probability $p=0.5$). Accuracy is a 'lumpy' measurement of performance (something is either correct or not), whereas cross-entropy is continuous; this means that selecting $\lambda$ on the basis of cross-entropy results in less noisy selection of the $\lambda$ value.

The third row shows the number of genes with non-zero coefficients, for each given value of $\lambda$ (this is effectively the inverse of sparsity, which is the proportion of zero coefficients).

The solid vertical grey line shows the value of $\lambda$ resulting in the best performance. The dashed vertical grey line shows the largest value of $\lambda$ with performance within one standard error of this. By default psupertime selects this value, giving increased sparsity at a minimal cost to performance. We show lines for selection using both classification error and cross-entropy; the thicker lines indicate which measure was actually used to select $\lambda$. In this case we used the $\lambda$ value within 1 s.e. of the best performance on cross-entropy. Reading down to the plot of non-zero genes, we can see that this resulted in just under 100 genes with non-zero coefficients.

`psupertime` ordering of cells

Like other pseudotime methods, one output from psupertime is an ordering for the individual cells (shown below). In this case of psupertime, this ordering should broadly follow the group-level labels given as inputs.

The x-axis shows the one-dimensional projection learned by psupertime. The different colours are the sequential labels used as input to psupertime, with the y-axis showing their densities over the pseudotime. The vertical lines indicate the point with equal probability of prediction between each pair of successive labels. For example, the first vertical line (blue, x=$\sim$-6) shows the value of pseudotime at which psupertime predicts the labels 1 year vs {5,6,21,22,38,44,54} years with equal probability.

g       = plot_labels_over_psupertime(psuper_obj, label_name='Donor age')
(g)

Interesting things you might observe:

Individual cells may have earlier or later values than others with the same label, possibly suggesting interesting subpopulations within a group label.
The thresholds learned by psupertime indicate how easy it is to distinguish between the different labels: where thresholds are close together, these labels are hard to separate, and where they are distant this task is easier.

Genes identified by `psupertime`

psupertime identifies a small set of genes which place the individual cells approximately in the order of the group-level labels. This list can be the most relevant output from psupertime. The plot below shows the 20 genes with the largest absolute coefficient values (subject to the absolute value being $>0.05$). Genes with positive coefficients will have expression positively correlated with the group-level labels, and vice versa for negative coefficients.

g       = plot_identified_gene_coefficients(psuper_obj)
(g)

Another way of examining these genes is to plot their expression values against the learned pseudotime values. The plot below shows the same set of genes, with the (z-scored log) expression values for all individual cells. This can show different profiles of expression, e.g. initially on, then switched off (ITM2A); and increasing or decreasing relatively constantly (CLU).

g       = plot_identified_genes_over_psupertime(psuper_obj, label_name='Donor age')
(g)

Such gene plots can also potentially identify branching, for example where expression of a given gene is initially unimodal, but later becomes bimodal.

`psupertime` as a classifier

psupertime is a classifier, in the sense that once trained, it can predict a label for any cell given as input. Comparing the predicted classes of cells against their known classes can identify interesting subpopulations of cells.

In the plot below, the x-axis shows the labels used to train psupertime; the y-axis shows the labels of the data used as input for this instance of psupertime (which in this case are the same as the predicted labels). The value in each box shows the number of cells with the known label for the row, which were predicted to have the column label. The colour corresponds to the proportions of the known label across the different possible predictions; within each row, the colours 'add up to 1'.

We can use this to identify groups of cells whose predicted labels differ from their true labels. For example, considering the cells with true label 6 years (third row from the bottom), two thirds have predicted donor age 5, while the remaining third have predicted donor age 21. [For this example dataset, this analysis doesn't seem super interesting, but there are others where it is useful! Look at the vignettes for the psupplementary package for more interesting examples.]

g       = plot_predictions_against_classes(psuper_obj)
(g)

psupertime can also be applied to data with unknown or different labels. In that case, the x-axis would remain the same, with the labels used to train the psupertime, but the y-axis would be different. Using it on the data used for training means we can check how accurate its labelling is (when psupertime is accurate, all the values should be on the diagonal), and in particular check whether it is less accurate for some labels.

Alternative ways to run `psupertime`

Above, we ran psupertime with the default settings. Here are some obvious settings you could consider changing:

Selection of genes The default setting for psupertime is to restrict the analysis to highly varying genes, using the method described in scran (see here). Here are some alternative methods for selecting genes for running psupertime.

# Option 1 (default): Select highly variable genes, using default settings for `scran`.
psuper_hvg      = psupertime(acinar_hvg_sce, y)
psuper_hvg      = psupertime(acinar_hvg_sce, y, sel_genes='hvg')

# Option 2: Select highly variable genes, using your own settings.
psuper_hvg_custom1  = psupertime(acinar_hvg_sce, y, sel_genes=list(hvg_cutoff=0.1, bio_cutoff=0.5))
psuper_hvg_custom2  = psupertime(acinar_hvg_sce, y, sel_genes=list(hvg_cutoff=0.1, bio_cutoff=0.5, span=0.1))

# Option 3: Use all genes
psuper_all      = psupertime(acinar_hvg_sce, y, sel_genes='all')

# Option 4: Use transcription factors
psuper_tf       = psupertime(acinar_hvg_sce, y, sel_genes='tf_human')

# Option 5: Use user-defined list of genes
psuper_sel      = psupertime(
  acinar_hvg_sce, y, 
  sel_genes='list', 
  gene_list=c('ITM2A', 'CLU', 'HSPH1', 'ADH1C', 'AMY2B')
  )

Performance characteristics By default, psupertime uses the largest regularization which results in performance within one standard error of the best performance (penalization='1se'). You can choose to have the maximum performance (typically resulting in a larger set of non-zero genes), by using penalization='best'. You can also change the measure of performance used.

# run psupertime with different settings
psuper_1se  = psupertime(acinar_hvg_sce, y, sel_genes='all', penalization='1se')
psuper_best = psupertime(acinar_hvg_sce, y, sel_genes='all', penalization='best')
psuper_acc  = psupertime(acinar_hvg_sce, y, sel_genes='all', score='class_error')

# display results
psuper_1se
psuper_best
psuper_acc

To see the full details of how psupertime can be used, read the documentation in the package:

?psupertime

wmacnair/psupertime documentation built on July 10, 2020, 8:12 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

wmacnair/psupertime
Psupertime is supervised pseudotime for single cell RNAseq data

In wmacnair/psupertime: Psupertime is supervised pseudotime for single cell RNAseq data

`psupertime` overview

Basic usage

`psupertime` outputs

Model diagnostics

`psupertime` ordering of cells

Genes identified by `psupertime`

`psupertime` as a classifier

Alternative ways to run `psupertime`

R Package Documentation

Browse R Packages

We want your feedback!

wmacnair/psupertime Psupertime is supervised pseudotime for single cell RNAseq data

In wmacnair/psupertime: Psupertime is supervised pseudotime for single cell RNAseq data

psupertime overview

Basic usage

psupertime outputs

Model diagnostics

psupertime ordering of cells

Genes identified by psupertime

psupertime as a classifier

Alternative ways to run psupertime

R Package Documentation

Browse R Packages

We want your feedback!

wmacnair/psupertime
Psupertime is supervised pseudotime for single cell RNAseq data

`psupertime` overview

`psupertime` outputs

`psupertime` ordering of cells

Genes identified by `psupertime`

`psupertime` as a classifier

Alternative ways to run `psupertime`