knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  package.startup.message = FALSE
)

psupertime overview

psupertime is an R package for analysing single cell RNA-seq data where groups of the cells have labels with a known or expected sequence (for example, samples from a time series experiment day1, day2, ..., day5). It uses ordinal logistic regression to identify a set of genes which recapitulate the group-level sequence for individual cells.

In this short vignette, we show a simple example of psupertime, corresponding to the data in Figure 1D of the biorxiv manuscript. As single cell RNA-seq datasets are typically large, we have only included one small dataset in the core psupertime package. To allow replication of all the examples in the psupertime paper, we have also made the psupplementary package, here.

Basic usage

psupertime requires two inputs:

We demonstrate psupertime on a small example dataset, comprising pancreas cells taken from donors of varying ages. The most straightforward way to run psupertime is very simple:

# load psupertime package
suppressPackageStartupMessages({
  library('psupertime')
  library('SingleCellExperiment')
  })

# load the data
data(acinar_hvg_sce)

# run psupertime
y           = acinar_hvg_sce$donor_age
psuper_obj  = psupertime(acinar_hvg_sce, y, sel_genes='all')
psuper_obj

(Calling psuper_obj or print(psuper_obj) gives a quick summary of the fitted model.)

Here, we ran psupertime using all genes, by specifying sel_genes='all'. Typically we would increase speed by restricting the analysis to only interesting genes, for example the default setting is to restrict to only highly varying genes (HVGs), as described in scran. To keep this main package light, we pre-selected the highly variable genes (in acinar_hvg_sce). The standard way to call psupertime, with a full-size dataset (i.e. without pre-selection of genes) is like this:

## not run
# psuper_obj  = psupertime(x, y)

psupertime outputs

Once you have run psupertime, you can produce a range of plots to check the outputs, for example:

Model diagnostics

The plot below shows how several measures of performance are affected by the extent of regularization, $\lambda$. The x-axis shows $\lambda$, indicating how strongly the model tries to set coefficients to zero. The optimal value of $\lambda$ is the one which gives the best mean performance over the training data, based on one of two possible measures of performance.

g       = plot_train_results(psuper_obj)
(g)

The first row shows classification error, namely the proportion of cells for which psupertime predicted the wrong label (equivalent to 1 - accuracy). The second row is cross-entropy, which quantifies how confidently the psupertime classifier predicts the correct label (so predicting the correct label with probability $p=0.9$ results in a lower cross-entropy than with probability $p=0.5$). Accuracy is a 'lumpy' measurement of performance (something is either correct or not), whereas cross-entropy is continuous; this means that selecting $\lambda$ on the basis of cross-entropy results in less noisy selection of the $\lambda$ value.

The third row shows the number of genes with non-zero coefficients, for each given value of $\lambda$ (this is effectively the inverse of sparsity, which is the proportion of zero coefficients).

The solid vertical grey line shows the value of $\lambda$ resulting in the best performance. The dashed vertical grey line shows the largest value of $\lambda$ with performance within one standard error of this. By default psupertime selects this value, giving increased sparsity at a minimal cost to performance. We show lines for selection using both classification error and cross-entropy; the thicker lines indicate which measure was actually used to select $\lambda$. In this case we used the $\lambda$ value within 1 s.e. of the best performance on cross-entropy. Reading down to the plot of non-zero genes, we can see that this resulted in just under 100 genes with non-zero coefficients.

psupertime ordering of cells

Like other pseudotime methods, one output from psupertime is an ordering for the individual cells (shown below). In this case of psupertime, this ordering should broadly follow the group-level labels given as inputs.

The x-axis shows the one-dimensional projection learned by psupertime. The different colours are the sequential labels used as input to psupertime, with the y-axis showing their densities over the pseudotime. The vertical lines indicate the point with equal probability of prediction between each pair of successive labels. For example, the first vertical line (blue, x=$\sim$-6) shows the value of pseudotime at which psupertime predicts the labels 1 year vs {5,6,21,22,38,44,54} years with equal probability.

g       = plot_labels_over_psupertime(psuper_obj, label_name='Donor age')
(g)

Interesting things you might observe:

Genes identified by psupertime

psupertime identifies a small set of genes which place the individual cells approximately in the order of the group-level labels. This list can be the most relevant output from psupertime. The plot below shows the 20 genes with the largest absolute coefficient values (subject to the absolute value being $>0.05$). Genes with positive coefficients will have expression positively correlated with the group-level labels, and vice versa for negative coefficients.

g       = plot_identified_gene_coefficients(psuper_obj)
(g)

Another way of examining these genes is to plot their expression values against the learned pseudotime values. The plot below shows the same set of genes, with the (z-scored log) expression values for all individual cells. This can show different profiles of expression, e.g. initially on, then switched off (ITM2A); and increasing or decreasing relatively constantly (CLU).

g       = plot_identified_genes_over_psupertime(psuper_obj, label_name='Donor age')
(g)

Such gene plots can also potentially identify branching, for example where expression of a given gene is initially unimodal, but later becomes bimodal.

psupertime as a classifier

psupertime is a classifier, in the sense that once trained, it can predict a label for any cell given as input. Comparing the predicted classes of cells against their known classes can identify interesting subpopulations of cells.

In the plot below, the x-axis shows the labels used to train psupertime; the y-axis shows the labels of the data used as input for this instance of psupertime (which in this case are the same as the predicted labels). The value in each box shows the number of cells with the known label for the row, which were predicted to have the column label. The colour corresponds to the proportions of the known label across the different possible predictions; within each row, the colours 'add up to 1'.

We can use this to identify groups of cells whose predicted labels differ from their true labels. For example, considering the cells with true label 6 years (third row from the bottom), two thirds have predicted donor age 5, while the remaining third have predicted donor age 21. [For this example dataset, this analysis doesn't seem super interesting, but there are others where it is useful! Look at the vignettes for the psupplementary package for more interesting examples.]

g       = plot_predictions_against_classes(psuper_obj)
(g)

psupertime can also be applied to data with unknown or different labels. In that case, the x-axis would remain the same, with the labels used to train the psupertime, but the y-axis would be different. Using it on the data used for training means we can check how accurate its labelling is (when psupertime is accurate, all the values should be on the diagonal), and in particular check whether it is less accurate for some labels.

Alternative ways to run psupertime

Above, we ran psupertime with the default settings. Here are some obvious settings you could consider changing:

Selection of genes The default setting for psupertime is to restrict the analysis to highly varying genes, using the method described in scran (see here). Here are some alternative methods for selecting genes for running psupertime.

# Option 1 (default): Select highly variable genes, using default settings for `scran`.
psuper_hvg      = psupertime(acinar_hvg_sce, y)
psuper_hvg      = psupertime(acinar_hvg_sce, y, sel_genes='hvg')

# Option 2: Select highly variable genes, using your own settings.
psuper_hvg_custom1  = psupertime(acinar_hvg_sce, y, sel_genes=list(hvg_cutoff=0.1, bio_cutoff=0.5))
psuper_hvg_custom2  = psupertime(acinar_hvg_sce, y, sel_genes=list(hvg_cutoff=0.1, bio_cutoff=0.5, span=0.1))

# Option 3: Use all genes
psuper_all      = psupertime(acinar_hvg_sce, y, sel_genes='all')

# Option 4: Use transcription factors
psuper_tf       = psupertime(acinar_hvg_sce, y, sel_genes='tf_human')

# Option 5: Use user-defined list of genes
psuper_sel      = psupertime(
  acinar_hvg_sce, y, 
  sel_genes='list', 
  gene_list=c('ITM2A', 'CLU', 'HSPH1', 'ADH1C', 'AMY2B')
  )

Performance characteristics By default, psupertime uses the largest regularization which results in performance within one standard error of the best performance (penalization='1se'). You can choose to have the maximum performance (typically resulting in a larger set of non-zero genes), by using penalization='best'. You can also change the measure of performance used.

# run psupertime with different settings
psuper_1se  = psupertime(acinar_hvg_sce, y, sel_genes='all', penalization='1se')
psuper_best = psupertime(acinar_hvg_sce, y, sel_genes='all', penalization='best')
psuper_acc  = psupertime(acinar_hvg_sce, y, sel_genes='all', score='class_error')

# display results
psuper_1se
psuper_best
psuper_acc

To see the full details of how psupertime can be used, read the documentation in the package:

?psupertime


wmacnair/psupertime documentation built on July 10, 2020, 8:12 p.m.