pruneScores: Prune out low-quality assignments
In SingleR: Reference-Based Single-Cell RNA-Seq Annotation

Description Usage Arguments Details Value Applying a hard filter on the deltas Filtering on fine-tuning scores Author(s) See Also Examples

View source: R/pruneScores.R

Remove low-quality assignments based on the cell-label score matrix returned by classifySingleR.

pruneScores(
  results,
  nmads = 3,
  min.diff.med = -Inf,
  min.diff.next = 0,
  get.thresholds = FALSE
)

`results`	A DataFrame containing the output generated by `SingleR` or `classifySingleR`.
`nmads`	Numeric scalar specifying the number of MADs to use for defining low outliers in the per-label distribution of delta values (i.e., difference from median).
`min.diff.med`	Numeric scalar specifying the minimum acceptable delta for each cell.
`min.diff.next`	Numeric scalar specifying the minimum difference between the best score and the next best score in fine-tuning.
`get.thresholds`	Logical scalar indicating whether the per-label thresholds on the deltas should be returned.

By itself, the SingleR algorithm will always assign a label to every cell. This occurs even if the cell's true label is not represented in the reference set of labels, resulting in assignment of an incorrect label to that cell. The pruneScores function aims to mitigate this effect by removing poor-quality assignments with “low” scores.

We compute a “delta” value for each cell, defined as the difference between the score for the assigned label and the and median score across all labels. If the delta is small, this indicates that the cell matches all labels with the same confidence such that the assigned label is not particularly meaningful. The aim is to discard low delta values caused by (i) ambiguous assignments with closely related reference labels and (ii) incorrect assignments that match poorly to all reference labels.

We use an outlier-based approach to obtain a minimum threshold for filtering “low” delta values. For each (pre-fine-tuning) label, we obtain a distribution of deltas across all assigned cells. Cells that are more than nmads below the median score for each label are ignored. This assumes that most cells are correctly assigned to their true label and that cells of the same label have a unimodal distribution of delta values.

Filtering on outliers is useful as it adapts to the spread and scale of delta values. For example, references with many closely related cell types will naturally yield lower deltas. By comparison, references with more distinct cell types would yield large deltas, even for cells that have no representative type in the reference and are incorrectly assigned to the next-most-related label. The outlier definition procedure adjusts naturally to these situations.

The default nmads is motivated by the fact that, for a normal distribution, 99% of observations lie within 3 standard deviations from the mean. Smaller values for nmads will increase the stringency of the pruning.

A logical vector is returned by default, specifying which assignments in results should be ignored.

If get.thresholds=TRUE, a numeric vector is returned containing the per-label thresholds on the deltas, as defined using the outlier-based approach with nmads.

If min.diff.med is specified, cells with deltas below this threshold are discarded. This is provided as an alternative filtering approach if the assumptions of outlier detection are violated. For example, if one label is consistently missassigned, the incorrect assignments would not be pruned. In such cases, one could set a threshold with min.diff.med to forcibly remove low-scoring cells.

It is possible for the per-label delta distribution to be multimodal yet still correct, e.g., due to cells belonging to subtypes nested within a main type label. This violates the unimodal assumption mentioned above for the outlier detection. In such cases, it may be better to set nmads=Inf and rely on min.diff.med for filtering instead.

Note that the deltas do not consider the effects of fine-tuning as scores are not comparable across different fine-tuning steps. In situations involving a majority of labels with only subtle distinctions, it is possible for the scores to be relatively similar but for the labels to be correctly assigned after fine-tuning. While outlier detection will automatically adapt to smaller scores, this effect should be considered if a threshold needs to be manually chosen for use in min.diff.med.

If fine-tuning was performed to generate results, we ignore any cell for which the fine-tuning score is not more than min.diff.next greater than the next best score. This aims to only retain labels for which there is no ambiguity in assignment, especially when some labels have similar scores because they are closely related (and thus easily confused).

Typical values of min.diff.next woud lie between [0, 0.1]. That said, the min.diff.next cutoff can be harmful in some applications involving highly related labels. From a user perspective, any confusion between these labels may not be a problem as the assignment is broadly correct; however, the best and next best scores will be very close and cause the labels to be unnecessarily discarded.

Aaron Lun and Daniel Bunis

classifySingleR, to generate results.

getDeltaFromMedian, to compute the per-cell deltas.

# Running the SingleR() example.
example(SingleR, echo=FALSE)

summary(pruneScores(pred))
pruneScores(pred, get.thresholds=TRUE)

# Less stringent:
summary(pruneScores(pred, min.diff.med=0))
summary(pruneScores(pred, nmads=5))

# More stringent:
summary(pruneScores(pred, min.diff.med=0.1))
summary(pruneScores(pred, nmads=2))
summary(pruneScores(pred, min.diff.next=0.1))