classifySingleR: Classify cells with SingleR
In SingleR: Reference-Based Single-Cell RNA-Seq Annotation

Description Usage Arguments Details Value Author(s) See Also Examples

Assign labels to each cell in a test dataset, using a pre-trained classifier combined with an iterative fine-tuning approach.

classifySingleR(
  test,
  trained,
  quantile = 0.8,
  fine.tune = TRUE,
  tune.thresh = 0.05,
  sd.thresh = NULL,
  prune = TRUE,
  assay.type = "logcounts",
  check.missing = TRUE,
  BPPARAM = SerialParam()
)

`test`	A numeric matrix of single-cell expression values where rows are genes and columns are cells. Alternatively, a SummarizedExperiment object containing such a matrix.
`trained`	A List containing the output of the `trainSingleR` function. Alternatively, a List of Lists produced by `trainSingleR` for multiple references.
`quantile`	A numeric scalar specifying the quantile of the correlation distribution to use to compute the score for each label.
`fine.tune`	A logical scalar indicating whether fine-tuning should be performed.
`tune.thresh`	A numeric scalar specifying the maximum difference from the maximum correlation to use in fine-tuning.
`sd.thresh`	A numeric scalar specifying the threshold on the standard deviation, for use in gene selection during fine-tuning. This is only used if `genes="sd"` when constructing `trained` and defaults to the value used in `trainSingleR`.
`prune`	A logical scalar indicating whether label pruning should be performed.
`assay.type`	Integer scalar or string specifying the matrix of expression values to use if `test` is a SummarizedExperiment.
`check.missing`	Logical scalar indicating whether rows should be checked for missing values (and if found, removed).
`BPPARAM`	A BiocParallelParam object specifyign the parallelization scheme to use.

Consider each cell in the test set test and each label in the training set. We compute Spearman's rank correlations between the test cell and all cells in the training set with the given label, based on the expression profiles of the genes selected by trained. The score is defined as the quantile of the distribution of correlations, as specified by quantile. (Technically, we avoid explicitly computing all correlations by using a nearest neighbor search, but the resulting score is the same.) After repeating this across all labels, the label with the highest score is used as the prediction for that cell.

If fine.tune=TRUE, an additional fine-tuning step is performed for each cell to improve resolution. We identify all labels with scores that are no more than tune.thresh below the maximum score. These labels are used to identify a fresh set of marker genes, and the calculation of the score is repeated using only these genes. The aim is to refine the choice of markers and reduce noise when distinguishing between closely related labels. The best and next-best scores are reported in the output for use in diagnostics, e.g., pruneScores.

The default assay.type is set to "logcounts" simply for consistency with trainSingleR. In practice, the raw counts (for UMI data) or the transcript counts (for read count data) can also be used without normalization and log-transformation. Any monotonic transformation will have no effect the calculation of the correlation values other than for some minor differences due to numerical precision.

If prune=TRUE, label pruning is performed as described in pruneScores with default arguments. This aims to remove low-quality labels that are ambiguous or correspond to misassigned cells. However, the default settings can be somewhat aggressive and discard otherwise useful labels in some cases - see ?pruneScores for details.

If trained was generated from multiple references, the per-reference statistics are combined into a single DataFrame of results. This is done using combineRecomputedResults if recompute=TRUE in trainSingleR, otherwise it is done using combineCommonResults.

A DataFrame where each row corresponds to a cell in test. In the case of a single reference, this contains:

scores, a numeric matrix of correlations at the specified quantile for each label (column) in each cell (row). This will contain NAs if multiple references were supplied to trainSingleR with recompute=TRUE.
first.labels, a character vector containing the predicted label before fine-tuning. Only added if fine.tune=TRUE.
tuned.scores, a DataFrame containing first and second. These are numeric vectors containing the best and next-best scores at the final round of fine-tuning for each cell. Only added if fine.tune=TRUE.
labels, a character vector containing the predicted label based on the maximum entry in scores.
pruned.labels, a character vector containing the pruned labels where “low-quality”. els are replaced with NAs. Only added if prune=TRUE.

The metadata of the DataFrame contains:

common.genes, a character vector of genes used to compute the correlations prior to fine-tuning.
de.genes, a list of list of genes used to distinguish between each pair of labels. Only returned if genes="de" when constructing trained, see ?trainSingleR for more details.

In the case of multiple references, the output of combineCommonResults or combineRecomputedResults is returned, depending on whether recompute=TRUE when constructing trained. This is a DataFrame containing:

scores, a numeric matrix of scores for each cell (row) across all labels in all references (columns). This will contain NAs if recomputation is performed.
labels, first.labels (if fine.tune=TRUE) and pruned.labels (if prune=TRUE), containing the consolidated labels of varying flavors as described above.
orig.results, a DataFrame of DataFrames containing the results of running classifySingleR against each individual reference. Each nested DataFrame has the same format as described above.

See ?combineCommonResults and ?combineRecomputedResults for more details.

Aaron Lun, based on the original SingleR code by Dvir Aran.

trainSingleR, to prepare the training set for classification.

pruneScores, to remove low-quality labels based on the scores.

combineCommonResults, to combine results from multiple references.

# Mocking up data with log-normalized expression values:
ref <- .mockRefData()
test <- .mockTestData(ref)

ref <- scuttle::logNormCounts(ref)
test <- scuttle::logNormCounts(test)

# Setting up the training:
trained <- trainSingleR(ref, label=ref$label)

# Performing the classification:
pred <- classifySingleR(test, trained)
table(predicted=pred$labels, truth=test$label)