knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Summary

scPred is a general method to predict cell types based on variance structure decomposition. It selects the most cell type-informative principal components from a dataset and trains a prediction model for each cell type. The principal training axes are projected onto the test dataset to obtain the PCs scores for the test dataset and the trained model(s) is/are used to classify single cells.

For more details see our pre-print on bioRxiv:

scPred: Single cell prediction using singular value decomposition and machine learning classification

Application of scPred

First, we load the scPred package and tidyverse.

library("scPred")
library("tidyverse")

We will work with single cell data of pluripotent, blood, skin and neural cells sequenced at low coverage. For more details about the study, see Low-coverage single-cell mRNA sequencing reveals cellular heterogeneity and activated signaling pathways in developing cerebral cortex.

The count matrix and metadata may be obtained here from Hemberg's lab.

Read the gene expression data (SingleCellExperiment object), calculate CPM values and extract metadata.

download.file("https://scrnaseq-public-datasets.s3.amazonaws.com/scater-objects/pollen.rds", destfile = "~/Downloads/pollen.rds")
require(SingleCellExperiment)
pollen <- readRDS("~/Downloads/pollen.rds")
pollen_counts <- normcounts(pollen)
pollen_cpm  <- apply(pollen_counts, 2, function(x) (x/sum(x))*1000000)
pollen_metadata <- as.data.frame(colData(pollen))

Let's explore the cell type information

table(pollen_metadata$cell_type2)

A total of r length(pollen_metadata$cell_type2) cells are included in the dataset.

For demostration purposes, we split a gene expression matrix into two groups (train and test datasets) based cell type information using the createDataPartition() function from the caret package (already loaded with scPred).

The train partition will be used to train a prediction models for each cell type and finally, the models will be tested using the test partition.

set.seed(1234)
i <- createDataPartition(pollen_metadata$cell_type2, p = 0.70, list = FALSE)
train_data <- t(pollen_cpm[, i])
test_data <- t(pollen_cpm[, -i])

train_info <- pollen_metadata[i, , drop = FALSE]
test_info <- pollen_metadata[-i, , drop = FALSE]

Training step

Eigendecomposition

The first part of the scPred algorithm consists on decomposing the gene expresion matrix of the training dataset to obtained a low dimensional space that can describe most of the variance of the dataset. The eigenDecompose function calculates the first n principal components and log-transforms the input gene expression values to stabilize the variance. It returns an scPred object.

set.seed(1234)
scp <- eigenDecompose(train_data, n = 10)

Then, we assign the metadata containing the cell type information. Row names in the metadata dataframe must match the row names from the eigendecompsed gene expression matrix.

scPred::metadata(scp) <- train_info

Feature selection

Next, we select the principal components that explain the class identity of each cell type using the getFeatureSpace function. This function applies a Wilcoxcon rank sum test to determine the informative principal components according to a categorical variable variable. In this case, we want to predict the cell types in the cell_type2 columns from the metadata. Run ?getFeatureSpace for more details.

scp <- getFeatureSpace(scp, pVar = "cell_type2")

The features slot contains the principal components that explain the class identity.

All prrincipal components for each cell type are ranked by p-value.

scp@features

We can plot the principal components grouped by the prediction variable using the plotEigen() function

plotEigen(scp, group = "cell_type2")

Model training

We can now train prediction models for blood, dermal, neural, and pluripotent cell types.

scp <- trainModel(scp, seed = 66)

If we print the scPred object we can look at a summary of the slots contained in it.

The four models showed a specificity of 1 and a sensitivity of 0.99 to 1.

scp

We can plot the distribution of probabilities to see the performance of the predictions for each cell class using

The getTrainResults() function extracts the predictions results obtained from the resampling step for training the prediction model.

res <- getTrainResults(scp)

We can plot the calculated probabilities for each cell type versus our cell labels:

mapply(function(x,y){dplyr::rename(x, probability = !! enquo(y))}, res, names(res), SIMPLIFY = FALSE) %>% 
  Reduce(rbind, .) -> train_probabilities

train_probabilities %>% 
    select(object, obs, probability, "other") %>% 
    ggplot() +
    aes(probability, fill = obs) +
    geom_histogram(bins = 30, color = "black") +
    geom_vline(xintercept = 0.9, color = "red") +
  facet_wrap(~object) +
    theme_bw()

In the previous figure we can observe that a threshold of 0.9 classifies all dermal, neural and pluripotent cells correctly and almost all blood cells too. Each panel represents a prediction model and the colors the known true classes. All other cells are cells except the positive class (for example, for the blood prediction model all other cells are either dermal, neural, or pluripotent)

Prediction step

Once the models have been trained they can be applied to predict cell types in other dataset, for this demonstration we'll use the test partition/ scPredict() projects the training principal axes onto the test dataset and predicts the cell identity using the trained models. By default, scPredict() uses a threshold of 0.9 to classify the cells into categories.

predictions <- scPredict(scp, newData = test_data, threshold = 0.9)

scPredict() returns a dataframe with the probabilities of each cell to belong to any of the cell classes. The predClass columns is set using the provided threshold.

predictions

We can plot the probabilities and compare our predictions to the true cell types

predictions %>% 
  mutate(true = test_info$cell_type2) %>% 
  gather(key = "model", value = "probability", 1:4) %>% 
  ggplot() +
  aes(probability, fill = true) +
  geom_histogram(color = "black") +
  geom_vline(xintercept = 0.9, color = "red") +
  facet_wrap(~model) +
  theme_bw() +
  scale_fill_viridis_d()

In the previous plot, each panel represents the prediction model for each cell type with the output distribution probabilities. The colors in all plot represents the "true". We can observe that all blood, neural and pluripotent cells were correctly classified using a threshold of 0.9. Only one dermal cell was labeled as unassigned as it was below the threshold. This cell has a probability of 0.75 of being dermal.

Reproducibility

options(width = 70)
devtools::session_info(include_base = TRUE)


IMB-Computational-Genomics-Lab/scPred documentation built on Jan. 11, 2020, 7:37 a.m.