knitr::opts_chunk$set( fig.width = 6, fig.height = 4.8, cache = TRUE )
scPred
is a general method to classify cells based on a low-dimensional
representation of gene expression (e.g. PCA).
For more details see our paper in Genome Biology:
scPred: accurate supervised method for cell-type classification from single-cell RNA-seq data
scPred
First, we'll load the scPred
package and Seurat
.
library("scPred") library("Seurat") library("magrittr")
We will work with Peripheral Mononuclear Blood Cells (PBMCs) from two different individuals. The libraries were processed using the Chromium system -10× Genomics- and sequenced with an Illumina NovaSeq 6000 platform. See Comparative performance of the BGI and Illumina sequencing technology for single-cell RNA-sequencing for more details on the samples.
For this tutorial, we'll use the PBMCs from one individual to build cell classifiers for the populations of interest. Then, we'll apply these models to an indepent dataset of PBMCs from another independent individual.
reference <- scPred::pbmc_1 query <- scPred::pbmc_2
scPred
is now built to be incorporated withing the Seurat framework.
Similar to clustering in Seurat
, scPred
uses the cell embeddings from a
principal component analysis to make inferences about cell-type identity.
However —unlike clustering—, scPred
trains classifiers for each cell type of
interest in a supervised manner by using the known cell identity from a
reference dataset to guide the classification of cells in a different data set.
The following code:
reference <- reference %>% NormalizeData() %>% FindVariableFeatures() %>% ScaleData() %>% RunPCA() %>% RunUMAP(dims = 1:30)
The column cell_type
contains the identity of each cell in the meta data slot.
Let's plot the UMAP and grouping the cells by cell type.
DimPlot(reference, group.by = "cell_type", label = TRUE, repel = TRUE)
scPred
Firstly, let's get the feature space to train the classifiers. By default, scPred
will use all principal components. The reference labels of each cells are
specified as the second parameter value of the function (in this case the
cell_type
column.
getFeatureSpace
will create a scPred
object stored in the @misc
slot. This
object will contained all required information to classify cells.
See ?getFeatureSpace
help documentation.
reference <- getFeatureSpace(reference, "cell_type")
Secondly, we train the classifiers for each cell using the trainModel
function.
By default, scPred
will use a support vector machine with a radial kernel.
reference <- trainModel(reference)
Training probabilities for each cell in the reference data can be accessed using
the get_probabilities
method:
get_probabilities(reference) %>% head()
We can use the get_scpred
method to retrieve the scPred
object from the
Seurat
object. Printing a scPred
object will show for each cell type:
get_scpred(reference)
To visualize the performance for each cell type we can use the plot_probabilities
function:
plot_probabilities(reference)
From the previous plot we can observe an overall lower performance for classical monocytes (cMono) and non-classical monocytes (ncMono).
Depending on the data, other models may show an better performance.
scPred
is built on top of the caret
package and allows using a large
set of prediction models (e.g. logistic regression, decision trees,
bagging, neural networks, etc). To see the list of available models see
available models in caret.
A different model can be specified using the model
parameter and providing the
method value from caret
(e.g. mda
for a mixture discriminant analysis
using the mda
package). Additionally, if only an mda
model wants to be applied
to a subset of cells, we can specify this using the reclassify
parameter. In this
case, we want to train different models for "cMono" and "ncMono" to improve their
classification performance:
reference <- trainModel(reference, model = "mda", reclassify = c("cMono", "ncMono"))
The code above trains a mixture discriminant analysis for two cell types and preserves the previous support vector machines for the remaing cell types.
We can observe a change in the sensitivity for "cMono" and "ncMono":
get_scpred(reference)
and also verify that higher probabilities for these cell types were obtained by plotting the training probabilities again:
plot_probabilities(reference)
An important requirement for classifying cells is using the
same normalization method for both the reference
and the query
datasets.
First, let's normalize the query dataset (cells to be classfied).
query <- NormalizeData(query)
Finally, we ca classify the cells from the query
data using the scPredict
function. The first argument corresponds to the query
object and the second to
the reference
object (with a scPred model trained already).
scPred
now uses Harmony
to align the query data onto the training low-dimensional
space used as reference. Once the data is aligned, cells are classified using
the pre-trained models.
scPredict
will return the query dataset. Make sure the left-side value of the<-
operator corresponds to the query data.
query <- scPredict(query, reference)
scPred
will store the final classifications in the scpred_prediction
column
of the Seurat meta data. Likewise, it will store a the aligned data and store
it as a scpred
reduction.
Let's plot the classifications over the aligned data.
DimPlot(query, group.by = "scpred_prediction", reduction = "scpred")
We can also run UMAP using the aligned data as an input
query <- RunUMAP(query, reduction = "scpred", dims = 1:30)
and plot the predicted labels for each cell type over the UMAP:
DimPlot(query, group.by = "scpred_prediction", label = TRUE, repel = TRUE)
We can compare the results with the original labels:
DimPlot(query, group.by = "cell_type", label = TRUE, repel = TRUE)
Additionally, scPred
stores the probabilities of each cell in the @meta.data
slot of the query Seurat object. We can visualize the probabilities over the
UMAP plot:
FeaturePlot(query, c("scpred_B.cell", "scpred_CD4.T.cell", "scpred_CD8.T.cell", "scpred_cMono", "scpred_ncMono", "scpred_Plasma.cell", "scpred_cDC", "scpred_pDC"))
To verify the performance of the models in the query
dataset, we can use the
crossTab
to create a contingency table using two colums from the metadata. In
this example, the cell type info is sotred in the cell_type
columns and the
predicted labels for each cell in the scpred_prediction
column.
crossTab(query, "cell_type", "scpred_prediction")
The proportion of cells can be obtained using output = "prop
crossTab(query, "cell_type", "scpred_prediction", output = "prop")
The raw models for each cell type can be retrieved using the get_classifiers()
function. This will return a list of train
objects.
get_classifiers(reference)
Each model can be normally treated using the caret
enviroment. For example,
we can plot the performance resamples using the plot.train
:
caret::plot.train(get_classifiers(reference)[["NK cell"]])
As shown before, a different classification method can be used using the model
parameter by providing a distinct method value of type classification as
handled by caret (see available models in caret).
Most available models will require the user to install new packages.
The following code trains a logistic regression via glm()
for each cell type:
reference <- trainModel(reference, model = "glm") get_scpred(reference)
Training and alignning the data are separate processes. Therefore, if a query
dataset has already being aligned to a reference
data via scPred/harmony and
the prediction models have changed, then we can use recompute_alignment = FALSE
to avoid aligning step (as the alignment is already stored in the query
object)
query <- scPredict(query, reference, recompute_alignment = FALSE)
The code above will only apply the classificatio models.
By default, scPred
now uses a relaxed probability theshold of 0.55
to label
cells. If none of the classifiers provide a probability higher than the threshold
for a given cell, then it is labelled as "unassigned". This value can be changed
using the threshold
parameter:
query <- scPredict(query, reference, recompute_alignment = FALSE, threshold = 0.9)
In the case of a binary classification (only two cell types), a threshold equals
0.5
implies no "unassigned labeling".
Depending on the sample size of the reference dataset and the number of cell types,
training models can be computatinally expensive. The resampling performed for
each model can be parallelized via doParallel
to speed-up the training step as
follows:
library(doParallel) cl <- makePSOCKcluster(2) registerDoParallel(cl) reference <- trainModel(reference, model = "mda", allowParallel = TRUE) stopCluster(cl)
allowParallel = TRUE
has to be set in order fortrainModel
to be able to run the resamplings in parallel
The previous code uses 2 cores.
See Caret parallel processing for more details
scPred
classifiers without Seurat
objectOnce final scPred
models have been obtained, we can extract the scPred
object
from the Seurat
object and apply the classifiers in other datasets.
scpred <- get_scpred(reference) query <- scPredict(query, scpred)
From now on, only the scPred
models can be imported and applied to other query
Seurat
objects/datasets.
options(width = 120) devtools::session_info()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.