classifySingleR: Classify cells with SingleR

Description Usage Arguments Details Value Author(s) See Also Examples

View source: R/classifySingleR.R

Description

Assign labels to each cell in a test dataset, using a pre-trained classifier combined with an iterative fine-tuning approach.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
classifySingleR(
  test,
  trained,
  quantile = 0.8,
  fine.tune = TRUE,
  tune.thresh = 0.05,
  sd.thresh = NULL,
  prune = TRUE,
  assay.type = "logcounts",
  check.missing = TRUE,
  BPPARAM = SerialParam()
)

Arguments

test

A numeric matrix of single-cell expression values where rows are genes and columns are cells.

Alternatively, a SummarizedExperiment object containing such a matrix.

trained

A List containing the output of the trainSingleR function. Alternatively, a List of Lists produced by trainSingleR for multiple references.

quantile

A numeric scalar specifying the quantile of the correlation distribution to use to compute the score for each label.

fine.tune

A logical scalar indicating whether fine-tuning should be performed.

tune.thresh

A numeric scalar specifying the maximum difference from the maximum correlation to use in fine-tuning.

sd.thresh

A numeric scalar specifying the threshold on the standard deviation, for use in gene selection during fine-tuning. This is only used if genes="sd" when constructing trained and defaults to the value used in trainSingleR.

prune

A logical scalar indicating whether label pruning should be performed.

assay.type

Integer scalar or string specifying the matrix of expression values to use if test is a SummarizedExperiment.

check.missing

Logical scalar indicating whether rows should be checked for missing values (and if found, removed).

BPPARAM

A BiocParallelParam object specifyign the parallelization scheme to use.

Details

Consider each cell in the test set test and each label in the training set. We compute Spearman's rank correlations between the test cell and all cells in the training set with the given label, based on the expression profiles of the genes selected by trained. The score is defined as the quantile of the distribution of correlations, as specified by quantile. (Technically, we avoid explicitly computing all correlations by using a nearest neighbor search, but the resulting score is the same.) After repeating this across all labels, the label with the highest score is used as the prediction for that cell.

If fine.tune=TRUE, an additional fine-tuning step is performed for each cell to improve resolution. We identify all labels with scores that are no more than tune.thresh below the maximum score. These labels are used to identify a fresh set of marker genes, and the calculation of the score is repeated using only these genes. The aim is to refine the choice of markers and reduce noise when distinguishing between closely related labels. The best and next-best scores are reported in the output for use in diagnostics, e.g., pruneScores.

The default assay.type is set to "logcounts" simply for consistency with trainSingleR. In practice, the raw counts (for UMI data) or the transcript counts (for read count data) can also be used without normalization and log-transformation. Any monotonic transformation will have no effect the calculation of the correlation values other than for some minor differences due to numerical precision.

If prune=TRUE, label pruning is performed as described in pruneScores with default arguments. This aims to remove low-quality labels that are ambiguous or correspond to misassigned cells. However, the default settings can be somewhat aggressive and discard otherwise useful labels in some cases - see ?pruneScores for details.

If trained was generated from multiple references, the per-reference statistics are combined into a single DataFrame of results. This is done using combineRecomputedResults if recompute=TRUE in trainSingleR, otherwise it is done using combineCommonResults.

Value

A DataFrame where each row corresponds to a cell in test. In the case of a single reference, this contains:

The metadata of the DataFrame contains:

In the case of multiple references, the output of combineCommonResults or combineRecomputedResults is returned, depending on whether recompute=TRUE when constructing trained. This is a DataFrame containing:

See ?combineCommonResults and ?combineRecomputedResults for more details.

Author(s)

Aaron Lun, based on the original SingleR code by Dvir Aran.

See Also

trainSingleR, to prepare the training set for classification.

pruneScores, to remove low-quality labels based on the scores.

combineCommonResults, to combine results from multiple references.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
# Mocking up data with log-normalized expression values:
ref <- .mockRefData()
test <- .mockTestData(ref)

ref <- scuttle::logNormCounts(ref)
test <- scuttle::logNormCounts(test)

# Setting up the training:
trained <- trainSingleR(ref, label=ref$label)

# Performing the classification:
pred <- classifySingleR(test, trained)
table(predicted=pred$labels, truth=test$label)

SingleR documentation built on Feb. 4, 2021, 2:01 a.m.