R/pruneScores.R
In SingleR: Reference-Based Single-Cell RNA-Seq Annotation

Documented in pruneScores

#' Prune out low-quality assignments
#'
#' Remove low-quality assignments based on the cell-label score matrix returned by \code{\link{classifySingleR}}.
#'
#' @param results A \linkS4class{DataFrame} containing the output generated by \code{\link{SingleR}} or \code{\link{classifySingleR}}.
#' @param nmads Numeric scalar specifying the number of MADs to use for defining low outliers in the per-label distribution of delta values (i.e., difference from median).
#' @param min.diff.med Numeric scalar specifying the minimum acceptable delta for each cell.
#' @param min.diff.next Numeric scalar specifying the minimum difference between the best score and the next best score in fine-tuning.
#' @param get.thresholds Logical scalar indicating whether the per-label thresholds on the deltas should be returned.
#' 
#' @return
#' A logical vector is returned by default, specifying which assignments in \code{results} should be ignored.
#'
#' If \code{get.thresholds=TRUE}, a numeric vector is returned containing the per-label thresholds on the deltas, as defined using the outlier-based approach with \code{nmads}.
#'
#' @details
#' By itself, the SingleR algorithm will always assign a label to every cell.
#' This occurs even if the cell's true label is not represented in the reference set of labels,
#' resulting in assignment of an incorrect label to that cell.
#' The \code{pruneScores} function aims to mitigate this effect by removing poor-quality assignments with \dQuote{low} scores.
#'
#' We compute a \dQuote{delta} value for each cell, defined as the difference between the score for the assigned label and the and median score across all labels.
#' If the delta is small, this indicates that the cell matches all labels with the same confidence such that the assigned label is not particularly meaningful.
#' The aim is to discard low delta values caused by (i) ambiguous assignments with closely related reference labels and (ii) incorrect assignments that match poorly to all reference labels.
#' 
#' We use an outlier-based approach to obtain a minimum threshold for filtering \dQuote{low} delta values.
#' For each (pre-fine-tuning) label, we obtain a distribution of deltas across all assigned cells.
#' Cells that are more than \code{nmads} below the median score for each label are ignored.
#' This assumes that most cells are correctly assigned to their true label and that cells of the same label have a unimodal distribution of delta values.
#' 
#' Filtering on outliers is useful as it adapts to the spread and scale of delta values.
#' For example, references with many closely related cell types will naturally yield lower deltas.
#' By comparison, references with more distinct cell types would yield large deltas, even for cells that have no representative type in the reference and are incorrectly assigned to the next-most-related label.
#' The outlier definition procedure adjusts naturally to these situations.
#'
#' The default \code{nmads} is motivated by the fact that, for a normal distribution, 99\% of observations lie within 3 standard deviations from the mean.
#' Smaller values for \code{nmads} will increase the stringency of the pruning.
#'
#' @section Applying a hard filter on the deltas: 
#' If \code{min.diff.med} is specified, cells with deltas below this threshold are discarded.
#' This is provided as an alternative filtering approach if the assumptions of outlier detection are violated.
#' For example, if one label is consistently missassigned, the incorrect assignments would not be pruned.
#' In such cases, one could set a threshold with \code{min.diff.med} to forcibly remove low-scoring cells.
#'
#' It is possible for the per-label delta distribution to be multimodal yet still correct,
#' e.g., due to cells belonging to subtypes nested within a main type label.
#' This violates the unimodal assumption mentioned above for the outlier detection.
#' In such cases, it may be better to set \code{nmads=Inf} and rely on \code{min.diff.med} for filtering instead. 
#'
#' Note that the deltas do not consider the effects of fine-tuning as scores are not comparable across different fine-tuning steps.
#' In situations involving a majority of labels with only subtle distinctions, it is possible for the scores to be relatively similar but for the labels to be correctly assigned after fine-tuning.
#' While outlier detection will automatically adapt to smaller scores, this effect should be considered if a threshold needs to be manually chosen for use in \code{min.diff.med}. 
#'
#' @section Filtering on fine-tuning scores:
#' If fine-tuning was performed to generate \code{results},
#' we ignore any cell for which the fine-tuning score is not more than \code{min.diff.next} greater than the next best score.
#' This aims to only retain labels for which there is no ambiguity in assignment,
#' especially when some labels have similar scores because they are closely related (and thus easily confused).
#' 
#' Typical values of \code{min.diff.next} woud lie between [0, 0.1].
#' That said, the \code{min.diff.next} cutoff can be harmful in some applications involving highly related labels.
#' From a user perspective, any confusion between these labels may not be a problem as the assignment is broadly correct;
#' however, the best and next best scores will be very close and cause the labels to be unnecessarily discarded.
#' 
#' @author Aaron Lun and Daniel Bunis
#'
#' @seealso
#' \code{\link{classifySingleR}}, to generate \code{results}.
#'
#' \code{\link{getDeltaFromMedian}}, to compute the per-cell deltas.
#' 
#' @examples
#' # Running the SingleR() example.
#' example(SingleR, echo=FALSE)
#'
#' summary(pruneScores(pred))
#' pruneScores(pred, get.thresholds=TRUE)
#'
#' # Less stringent:
#' summary(pruneScores(pred, min.diff.med=0))
#' summary(pruneScores(pred, nmads=5))
#' 
#' # More stringent:
#' summary(pruneScores(pred, min.diff.med=0.1))
#' summary(pruneScores(pred, nmads=2))
#' summary(pruneScores(pred, min.diff.next=0.1))
#' 
#' @export
#' @importFrom stats median mad
pruneScores <- function(results, nmads=3, min.diff.med=-Inf, min.diff.next=0, get.thresholds=FALSE) {
    delta <- getDeltaFromMedian(results)
    keep <- delta >= min.diff.med

    tune.scores <- results$tuning.scores
    if (!is.null(tune.scores)) {
        keep <- keep & (tune.scores$first - tune.scores$second) >= min.diff.next
    }

    # Ignoring the fine-tuning when allocating cells to labels.
    labels <- results$labels
    by.label <- split(which(keep), labels[keep])

    thresholds <- list()
    for (l in names(by.label)) {
        idx <- by.label[[l]]
        current <- delta[idx]

        med <- median(current)
        MAD <- mad(current, center=med)
        thresholds[[l]] <- curthresh <- med - nmads * MAD

        keep[idx] <- keep[idx] & current >= curthresh
    }

    if (get.thresholds) {
        unlist(thresholds) 
    } else {
        !keep
    }
}