In powellgenomicslab/scPred: scPred - accurate supervised method for cell-type classification from single-cell RNA-seq data

knitr::opts_chunk$set(
  fig.width = 6, 
  fig.height = 4.8, 
  cache = TRUE
)

Summary

scPred is a general method to classify cells based on a low-dimensional representation of gene expression (e.g. PCA).

For more details see our paper in Genome Biology:

scPred: accurate supervised method for cell-type classification from single-cell RNA-seq data

Application of `scPred`

First, we'll load the scPred package and Seurat.

library("scPred")
library("Seurat")
library("magrittr")

We will work with Peripheral Mononuclear Blood Cells (PBMCs) from two different individuals. The libraries were processed using the Chromium system -10× Genomics- and sequenced with an Illumina NovaSeq 6000 platform. See Comparative performance of the BGI and Illumina sequencing technology for single-cell RNA-sequencing for more details on the samples.

For this tutorial, we'll use the PBMCs from one individual to build cell classifiers for the populations of interest. Then, we'll apply these models to an indepent dataset of PBMCs from another independent individual.

reference <- scPred::pbmc_1
query <- scPred::pbmc_2

scPred is now built to be incorporated withing the Seurat framework. Similar to clustering in Seurat, scPred uses the cell embeddings from a principal component analysis to make inferences about cell-type identity. However —unlike clustering—, scPred trains classifiers for each cell type of interest in a supervised manner by using the known cell identity from a reference dataset to guide the classification of cells in a different data set.

The following code:

Normalizes the gene expression data from the reference data set weighting by the counts by the "sequencing depth" of each cell and applying a natural logarithmic transformation
Finds a set of variable features by modeling the mean and variance log-transformed expression
Scales the expression of each gene bu subtracting the mean expression across all cells and dividing by the standard deviation
Runs a PCA
Runs a UMAP using the top 30 most variable PCs

reference <- reference %>% 
  NormalizeData() %>% 
  FindVariableFeatures() %>% 
  ScaleData() %>% 
  RunPCA() %>% 
  RunUMAP(dims = 1:30)

The column cell_type contains the identity of each cell in the meta data slot. Let's plot the UMAP and grouping the cells by cell type.

DimPlot(reference, group.by = "cell_type", label = TRUE, repel = TRUE)

Training classifiers with `scPred`

Firstly, let's get the feature space to train the classifiers. By default, scPred will use all principal components. The reference labels of each cells are specified as the second parameter value of the function (in this case the cell_type column.

getFeatureSpace will create a scPred object stored in the @misc slot. This object will contained all required information to classify cells. See ?getFeatureSpace help documentation.

reference <- getFeatureSpace(reference, "cell_type")

Secondly, we train the classifiers for each cell using the trainModel function. By default, scPred will use a support vector machine with a radial kernel.

reference <- trainModel(reference)

Training probabilities for each cell in the reference data can be accessed using the get_probabilities method:

get_probabilities(reference) %>% head()

We can use the get_scpred method to retrieve the scPred object from the Seurat object. Printing a scPred object will show for each cell type:

The number of cells
Number of features used to train the model
Prediction model (method)
Performance metrics

get_scpred(reference)

To visualize the performance for each cell type we can use the plot_probabilities function:

plot_probabilities(reference)

From the previous plot we can observe an overall lower performance for classical monocytes (cMono) and non-classical monocytes (ncMono).

Depending on the data, other models may show an better performance. scPred is built on top of the caret package and allows using a large set of prediction models (e.g. logistic regression, decision trees, bagging, neural networks, etc). To see the list of available models see available models in caret.

A different model can be specified using the model parameter and providing the method value from caret (e.g. mda for a mixture discriminant analysis using the mda package). Additionally, if only an mda model wants to be applied to a subset of cells, we can specify this using the reclassify parameter. In this case, we want to train different models for "cMono" and "ncMono" to improve their classification performance:

reference <- trainModel(reference, model = "mda", reclassify = c("cMono", "ncMono"))

The code above trains a mixture discriminant analysis for two cell types and preserves the previous support vector machines for the remaing cell types.

We can observe a change in the sensitivity for "cMono" and "ncMono":

get_scpred(reference)

and also verify that higher probabilities for these cell types were obtained by plotting the training probabilities again:

plot_probabilities(reference)

Cell classification

An important requirement for classifying cells is using the same normalization method for both the reference and the query datasets.

First, let's normalize the query dataset (cells to be classfied).

query <- NormalizeData(query)

Finally, we ca classify the cells from the query data using the scPredict function. The first argument corresponds to the query object and the second to the reference object (with a scPred model trained already).

scPred now uses Harmony to align the query data onto the training low-dimensional space used as reference. Once the data is aligned, cells are classified using the pre-trained models.

scPredict will return the query dataset. Make sure the left-side value of the <- operator corresponds to the query data.

query <- scPredict(query, reference)

scPred will store the final classifications in the scpred_prediction column of the Seurat meta data. Likewise, it will store a the aligned data and store it as a scpred reduction.

Let's plot the classifications over the aligned data.

DimPlot(query, group.by = "scpred_prediction", reduction = "scpred")

We can also run UMAP using the aligned data as an input

query <- RunUMAP(query, reduction = "scpred", dims = 1:30)

and plot the predicted labels for each cell type over the UMAP:

DimPlot(query, group.by = "scpred_prediction", label = TRUE, repel = TRUE)

We can compare the results with the original labels:

DimPlot(query, group.by = "cell_type", label = TRUE, repel = TRUE)

Additionally, scPred stores the probabilities of each cell in the @meta.data slot of the query Seurat object. We can visualize the probabilities over the UMAP plot:

FeaturePlot(query, c("scpred_B.cell", "scpred_CD4.T.cell", "scpred_CD8.T.cell", 
                     "scpred_cMono", "scpred_ncMono", "scpred_Plasma.cell", 
                     "scpred_cDC", "scpred_pDC"))

To verify the performance of the models in the query dataset, we can use the crossTab to create a contingency table using two colums from the metadata. In this example, the cell type info is sotred in the cell_type columns and the predicted labels for each cell in the scpred_prediction column.

crossTab(query, "cell_type", "scpred_prediction")

The proportion of cells can be obtained using output = "prop

crossTab(query, "cell_type", "scpred_prediction", output = "prop")

Advanced options

Accessing classifiers

The raw models for each cell type can be retrieved using the get_classifiers() function. This will return a list of train objects.

get_classifiers(reference)

Each model can be normally treated using the caret enviroment. For example, we can plot the performance resamples using the plot.train:

caret::plot.train(get_classifiers(reference)[["NK cell"]])

Using a different prediction model approach

As shown before, a different classification method can be used using the model parameter by providing a distinct method value of type classification as handled by caret (see available models in caret). Most available models will require the user to install new packages.

The following code trains a logistic regression via glm() for each cell type:

reference <- trainModel(reference, model = "glm")
get_scpred(reference)

Avoid aligning the data multiple times

Training and alignning the data are separate processes. Therefore, if a query dataset has already being aligned to a reference data via scPred/harmony and the prediction models have changed, then we can use recompute_alignment = FALSE to avoid aligning step (as the alignment is already stored in the query object)

query <- scPredict(query, reference, recompute_alignment = FALSE)

The code above will only apply the classificatio models.

Using a different probability threshold

By default, scPred now uses a relaxed probability theshold of 0.55 to label cells. If none of the classifiers provide a probability higher than the threshold for a given cell, then it is labelled as "unassigned". This value can be changed using the threshold parameter:

query <- scPredict(query, reference, recompute_alignment = FALSE, threshold = 0.9)

In the case of a binary classification (only two cell types), a threshold equals 0.5 implies no "unassigned labeling".

Parallel training

Depending on the sample size of the reference dataset and the number of cell types, training models can be computatinally expensive. The resampling performed for each model can be parallelized via doParallel to speed-up the training step as follows:

library(doParallel)
cl <- makePSOCKcluster(2)
registerDoParallel(cl)
reference <- trainModel(reference, model = "mda", allowParallel = TRUE)
stopCluster(cl)

allowParallel = TRUE has to be set in order for trainModel to be able to run the resamplings in parallel

The previous code uses 2 cores.

See Caret parallel processing for more details

Applying `scPred` classifiers without `Seurat` object

Once final scPred models have been obtained, we can extract the scPred object from the Seurat object and apply the classifiers in other datasets.

scpred <- get_scpred(reference)
query <- scPredict(query, scpred)

From now on, only the scPred models can be imported and applied to other query Seurat objects/datasets.

Session information

options(width = 120)
devtools::session_info()

powellgenomicslab/scPred documentation built on July 16, 2021, 12:14 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

powellgenomicslab/scPred
scPred - accurate supervised method for cell-type classification from single-cell RNA-seq data

In powellgenomicslab/scPred: scPred - accurate supervised method for cell-type classification from single-cell RNA-seq data

Summary

Application of `scPred`

Training classifiers with `scPred`

Cell classification

Advanced options

Accessing classifiers

Using a different prediction model approach

Avoid aligning the data multiple times

Using a different probability threshold

Parallel training

Applying `scPred` classifiers without `Seurat` object

Session information

R Package Documentation

Browse R Packages

We want your feedback!

powellgenomicslab/scPred scPred - accurate supervised method for cell-type classification from single-cell RNA-seq data

In powellgenomicslab/scPred: scPred - accurate supervised method for cell-type classification from single-cell RNA-seq data

Summary

Application of scPred

Training classifiers with scPred

Cell classification

Advanced options

Accessing classifiers

Using a different prediction model approach

Avoid aligning the data multiple times

Using a different probability threshold

Parallel training

Applying scPred classifiers without Seurat object

Session information

R Package Documentation

Browse R Packages

We want your feedback!

powellgenomicslab/scPred
scPred - accurate supervised method for cell-type classification from single-cell RNA-seq data

Application of `scPred`

Training classifiers with `scPred`

Applying `scPred` classifiers without `Seurat` object