Defines functions .checkSCE .scDblAddCD .xgbtrain .scDblscore .knnSmooth .evaluateKNN scDblFinder

Documented in scDblFinder

#' scDblFinder
#' Identification of heterotypic (or neotypic) doublets in single-cell RNAseq 
#' using cluster-based generation of artifical doublets.
#' @param sce A \code{\link[SummarizedExperiment]{SummarizedExperiment-class}},
#' \code{\link[SingleCellExperiment]{SingleCellExperiment-class}}, or array of
#' counts.
#' @param artificialDoublets The approximate number of artificial doublets to 
#' create. If \code{NULL}, will be the maximum of the number of cells or 
#' \code{5*nbClusters^2}.
#' @param clusters The optional cluster assignments (if omitted, will run 
#' clustering). This is used to make doublets more efficiently. \code{clusters} 
#' should either be a vector of labels for each cell, or the name of a colData 
#' column of \code{sce}. Alternatively, if it is a single integer, will 
#' determine how many clusters to create (using k-means clustering). This 
#' options should be used when distinct subpopulations are not expected in the 
#' data (e.g. trajectories).
#' @param samples A vector of the same length as cells (or the name of a column 
#' of \code{colData(x)}), indicating to which sample each cell belongs. Here, a 
#' sample is understood as being processed independently. If omitted, doublets 
#' will be searched for with all cells together. If given, doublets will be 
#' searched for independently for each sample, which is preferable if they 
#' represent different captures. If your samples were multiplexed using cell
#' hashes, want you want to give here are the different batches/wells (i.e. 
#' independent captures, since doublets cannot arise across them) rather
#' than biological samples.
#' @param trajectoryMode Logical; whether to generate doublets in trajectory 
#' mode (i.e. for datasets with gradients rather than separated subpopulations).
#' See \code{vignette("scDblFinder")} for more details.
#' @param knownDoublets An optional logical vector of known doublets (e.g. 
#' through cell barcodes), or the name of a colData column of `sce` containing
#' that information.
#' @param use.cxds Logical; whether to use the `cxds` scores in addition to 
#' information from artificial/known doublets as part of the predictors.
#' @param nfeatures The number of top features to use (default 1000)
#' @param dims The number of dimensions used.
#' @param dbr The expected doublet rate. By default this is assumed to be 1\% 
#' per thousand cells captured (so 4\% among 4000 thousand cells), which is 
#' appropriate for 10x datasets. Corrections for homeotypic doublets will be
#' performed on the given rate.
#' @param dbr.sd The uncertainty range in the doublet rate, interpreted as
#' a +/- around `dbr`. During thresholding, deviation from the expected doublet
#' rate will be calculated from these boundaries, and will be considered null 
#' within these boundaries.
#' @param k Number of nearest neighbors (for KNN graph). If more than one value
#' is given, the doublet density will be calculated at each k (and other values
#' at the highest k), and all the information will be used by the classifier.
#' If omitted, a reasonable set of values is used.
#' @param includePCs The index of principal components to include in the 
#' predictors (e.g. `includePCs=1:2`).
#' @param propRandom The proportion of the artificial doublets which 
#' should be made of random cells (as opposed to inter-cluster combinations).
#' @param propMarkers The proportion of features to select based on marker 
#' identification.
#' @param returnType Either "sce" (default), "table" (to return the table of 
#' cell attributes including artificial doublets), or "full" (returns an SCE
#' object containing both the real and artificial cells.
#' @param score Score to use for final classification.
#' @param metric Error metric to optimize during training (e.g. 'merror', 
#' 'logloss', 'auc', 'aucpr').
#' @param nrounds Maximum rounds of boosting. If NULL, will be determined
#' through cross-validation. When the training is based only on simulated 
#' doublets, we generally find lower limits to outperform cross-validation.
#' @param max_depth Maximum depth of decision trees
#' @param iter A positive integer indicating the number of scoring iterations
#' (ignored if `score` isn't based on classifiers). At each iteration, real 
#' cells that would be called as doublets are excluding from the training, and
#' new scores are calculated. Recommended values are 1 or 2.
#' @param threshold Logical; whether to threshold scores into binary doublet 
#' calls
#' @param verbose Logical; whether to print messages and the thresholding plot.
#' @param BPPARAM Used for multithreading when splitting by samples (i.e. when 
#' `samples!=NULL`); otherwise passed to eventual PCA and K/SNN calculations.
#' @param ... further arguments passed to \code{\link{getArtificialDoublets}}.
#' @return The \code{sce} object with several additional colData columns, in 
#' particular `scDblFinder.score` (the final score used) and `scDblFinder.class` 
#' (whether the cell is called as 'doublet' or 'singlet'). See 
#' \code{vignette("scDblFinder")} for more details; for alternative return 
#' values, see the `returnType` argument.
#' @details
#' This function generates artificial doublets from clusters of real cells, 
#' evaluates their prevalence in the neighborhood of each cells, and uses this 
#' along with additional features to classify doublets. The approach is 
#' complementary to doublets identified via cell hashes and SNPs in multiplexed
#' samples: the latter can identify doublets formed by cells of the same type
#' from two samples, which are nearly undistinguishable from real cells 
#' transcriptionally, but cannot identify doublets made by cells of the 
#' same sample. See \code{vignette("scDblFinder")} for more details on the 
#' method.
#' When multiple samples/captures are present, they should be specified using 
#' the \code{samples} argument. Although the classifier will be trained 
#' globally, thresholding and the more computationally-intensive steps will be 
#' performed separately for each sample (in parallel if \code{BPPARAM} is 
#' given).
#' When inter-sample doublets are available, they can be provided to 
#' `scDblFinder` through the \code{knownDoublets} argument to improve the 
#' identification of further doublets.
#' @import SingleCellExperiment BiocParallel xgboost
#' @importFrom SummarizedExperiment colData<- assayNames
#' @importFrom scuttle normalizeCounts
#' @importFrom scater runPCA
#' @importFrom methods is
#' @importFrom DelayedArray as.matrix
#' @importFrom BiocNeighbors findKNN
#' @importFrom BiocSingular IrlbaParam
#' @examples
#' library(SingleCellExperiment)
#' sce <- mockDoubletSCE()
#' sce <- scDblFinder(sce, dbr=0.1)
#' table(truth=sce$type, call=sce$scDblFinder.class)
#' @export
#' @rdname scDblFinder
#' @import SingleCellExperiment
#' @importFrom SummarizedExperiment rowData<-
#' @importFrom BiocParallel SerialParam bpnworkers
scDblFinder <- function( sce, clusters=NULL, samples=NULL, trajectoryMode=FALSE,
                         artificialDoublets=NULL, knownDoublets=NULL, 
                         use.cxds=TRUE, nfeatures=1000, dims=20, dbr=NULL, 
                         dbr.sd=0.015, k=NULL, includePCs=1:5, propRandom=0.1,
                         propMarkers=0, returnType=c("sce","table","full"),
                         metric="aucpr", nrounds=50, max_depth=5, iter=1, 
                         threshold=TRUE, verbose=is.null(samples), 
                         BPPARAM=SerialParam(), ...
  ## check arguments
  sce <- .checkSCE(sce)
  score <- match.arg(score)
  returnType <- match.arg(returnType)
    if(length(clusters)>1 || !is.numeric(clusters))
      clusters <- .checkColArg(sce, clusters)
  knownDoublets <- .checkColArg(sce, knownDoublets)
  samples <- .checkColArg(sce, samples)

  ## if clusters are given, it's more efficient to do feature selection before
  ## eventually splitting the dataset
  if(!is.null(clusters) && length(clusters)>1){
    sel_features <- selFeatures(sce, clusters, nfeatures=nfeatures, 
    sel_features <- row.names(sce)
    ## splitting by samples
        warning("`returnType='full'` ignored when splitting by samples")
    cs <- split(seq_along(samples), samples, drop=TRUE)
    names(nn) <- nn <- names(cs)
    ## run scDblFinder individually, without classification
    d <- bplapply(nn, BPPARAM=BPPARAM, FUN=function(n){ 
      x <- cs[[n]]
      if(!is.null(clusters) && length(clusters)>1) clusters <- clusters[x]
      tryCatch(scDblFinder(sce[sel_features,x], clusters=clusters, 
                  knownDoublets=knownDoublets, dims=dims, dbr=dbr, 
                  dbr.sd=dbr.sd, artificialDoublets=artificialDoublets, k=k, 
                  nfeatures=nfeatures, propRandom=propRandom, includePCs=c(),
                  propMarkers=propMarkers, trajectoryMode=trajectoryMode, 
                  returnType="table", threshold=FALSE, score="weighted", 
                  verbose=FALSE, ...),
                 stop("An error occured while processing sample '",n,"':\n", e)
    ## aggregate the property tables
    cn <- table(unlist(lapply(d, colnames)))
    cn <- names(cn)[cn==length(d)]
    ss <- factor(rep(seq_along(names(d)),vapply(d,nrow,integer(1))), 
                 levels=seq_along(names(d)), labels=names(d))
    d <- do.call(rbind, lapply(d, FUN=function(x){
      x$total.prop.real <- sum(x$type=="real",na.rm=TRUE)/nrow(x)
    d$sample <- ss
    ## score and thresholding
    d <- .scDblscore(d, scoreType=score, threshold=threshold, dbr=dbr, 
                     dbr.sd=dbr.sd, max_depth=max_depth, nrounds=nrounds, 
                     iter=iter, BPPARAM=BPPARAM, verbose=verbose, metric=metric)
    if(returnType=="table") return(d)
    return(.scDblAddCD(sce, d))
  ## Handling a single sample 
    warning("scDblFinder might not work well with very low numbers of cells.")
  if(verbose && ncol(sce)>25000)
    warning("You are trying to run scDblFinder on a very large number of ",
            "cells. If these are from different captures, please specify this",
            " using the `samples` argument.", immediate=TRUE)

  if(is.null(k)){ ## reasonble sets of ks (for KNN)
      k <- c(3,10,20)
      if((kmax <- max(ceiling(sqrt(ncol(sce)/6)),20))>=30) k <- c(k,kmax)
  orig <- sce
  wDbl <- c()
  ## if known doublets are given, we need to treat them separately
  if(!is.null(knownDoublets) && length(wDbl <- which(knownDoublets))>0){
    sce$knownDoublet <- knownDoublets
    sce.dbl <- sce[,wDbl,drop=FALSE]
    sce <- sce[,-wDbl,drop=FALSE]
    if(!is.null(clusters) && length(clusters)>1){
      clusters.dbl <- clusters[wDbl]
      clusters <- clusters[-wDbl]
      if(is.factor(clusters)) clusters <- droplevels(clusters)
  ## clustering (if not already provided)
  if(is.null(clusters) || length(clusters)==1){
    if(verbose) message("Clustering cells...")
      clusters <- fastcluster(sce, ndims=dims, k=clusters, nfeatures=nfeatures,
      clusters <- fastcluster(sce, ndims=dims, nfeatures=nfeatures,
  }else if(trajectoryMode && length(unique(clusters))>1){
    clusters <- list( k=clusters,
                          clusters, BPPARAM=BPPARAM) )
    cl <- clusters$k
    cl <- clusters
  nc <- length(unique(cl))
  if(nc==1) stop("Only one cluster generated. Consider specifying `cluster` ",
                 "(e.g. `cluster=10`)")
  if(verbose) message(nc, " clusters")
  ## feature selection
    sel_features <- selFeatures(sce[sel_features,], cl, nfeatures=nfeatures, 
  sce <- sce[sel_features,]
  if(length(wDbl)>0) sce.dbl <- sce.dbl[sel_features,]
  ## get the artificial doublets
    artificialDoublets <- min( 25000, max(5000,
                                          10*length(unique(cl))^2 ) )
      artificialDoublets <- min(ceiling(artificialDoublets*ncol(sce)),25000)
    message("Creating ~", artificialDoublets, " artifical doublets...")
  ad <- getArtificialDoublets(as.matrix(counts(sce)), n=artificialDoublets, 
                              clusters=clusters, propRandom=propRandom, ...)

  ado <- ad$origins
  ad <- ad$counts
  no <- ncol(sce) + length(wDbl)
  ado2 <- as.factor(c(rep(NA, no), as.character(ado)))
  src <- factor( rep(1:2, c(no,ncol(ad))), labels = c("real","artificial"))
  ctype <- factor( rep(c(1,2,2), c(ncol(sce),length(wDbl),ncol(ad))), 
                   labels=c("real","doublet") )
  if(verbose) message("Dimensional reduction")
  e <- counts(sce)
  if(!is.null(wDbl)) e <- cbind(e, counts(sce.dbl))
  e <- cbind(e, ad[row.names(sce),])
  # evaluate by library size and non-zero features
  lsizes <- Matrix::colSums(e)
  cxds_score <- NULL
  if(use.cxds) cxds_score <- cxds2(e, whichDbls=which(ctype==2))
  nfeatures <- Matrix::colSums(e>0)
  # skip normalization if data is too large
      e <- normalizeCounts(e)
    }, error=function(er){
      warning("Error in calculating norm factors:", er)
  if(is.null(dims)) dims <- 20
  pca <- tryCatch({
            scater::calculatePCA(e, dims, subset_row=seq_len(nrow(e)),
        }, error=function(msg){
            reducedDim( scater::runPCA( SingleCellExperiment(list(logcounts=e)), 
                                        ncomponents=dims, ntop=nrow(e),
                                        BSPARAM=BiocSingular::IrlbaParam()) )
  if(is.list(pca)) pca <- pca$x
  row.names(pca) <- colnames(e)
  ex <- getExpectedDoublets(clusters, dbr)
  d <- .evaluateKNN(pca, ctype, ado2, expected=ex, k=k, BPPARAM=BPPARAM, 
  if(is.list(clusters)) clusters <- clusters$k
  d[colnames(sce),"cluster"] <- clusters
  d$lsizes <- lsizes
  d$nfeatures <- nfeatures
  d$src <- src
  if(use.cxds) d$cxds_score <- cxds_score
  ## classify
  d <- .scDblscore(cbind(d, pca[,includePCs,drop=FALSE]), scoreType=score, 
                   threshold=threshold, dbr=dbr, dbr.sd=dbr.sd, nrounds=nrounds,
                   max_depth=max_depth, iter=iter, BPPARAM=BPPARAM,
                   verbose=verbose, metric=metric)

  if(returnType=="table") return(d)
      sce_out <- SingleCellExperiment(list(
          counts=cbind(counts(sce), ad[row.names(sce),])), colData=d)
      reducedDim(sce_out, "PCA") <- pca
      if(is(d,"DataFrame") && !is.null(metadata(d)$scDblFinder.stats)) 
        metadata(sce_out)$scDblFinder.stats <- metadata(d)$scDblFinder.stats
  rowData(orig)$scDblFinder.selected <- row.names(orig) %in% sel_features
  .scDblAddCD(orig, d)

#' @importFrom BiocNeighbors AnnoyParam
.evaluateKNN <- function(pca, ctype, origins, expected, k, 
                         BPPARAM=SerialParam(), verbose=TRUE){
  if(verbose) message("Finding KNN...")
  knn <- suppressWarnings(findKNN(pca, max(k), BPPARAM=BPPARAM,
  if(verbose) message("Evaluating cell neighborhoods...")
  knn$type <- matrix(as.numeric(ctype)[knn$index]-1, nrow=nrow(knn$index))
  knn$orig <- matrix(origins[knn$index], nrow=nrow(knn[[1]]))
  if(any(w <- knn$distance==0))
    knn$distance[w] <- min(knn$distance[knn$distance[,1]>0,1])
  md <- max(knn$distance[,1])
  dr <- t(vapply(seq_len(nrow(knn$distance)), FUN.VALUE=numeric(2L), 
                 w <- knn$type[x,]==1
                 dA <- ifelse(length(wA <- which(w))==0, 2*md,
                 dB <- ifelse(length(wB <- which(!w))==0, 2*md,
  dw <- sqrt(max(k)-seq_len(max(k))) * 1/knn$distance
  dw <- dw/rowSums(dw)
  d <- data.frame( row.names=row.names(pca), type=ctype, cluster=NA, 
                   .getMostLikelyOrigins(knn, origins) )
      for(ki in rev(k)[-1])
          d[[paste0("ratio.k",ki)]] <- rowSums(knn$type[,seq_len(ki)])/ki
  w <- which(d$type=="doublet")
  class.weighted <- vapply( split(d$weighted[w], d$mostLikelyOrigin[w]), 
                            FUN.VALUE=numeric(1L), FUN=mean )
  d$difficulty <- 1
  w <- which(!is.na(d$mostLikelyOrigin))
  d$difficulty[w] <- 1-class.weighted[d$mostLikelyOrigin[w]]
  #d$difficulty <- .knnSmooth(knn, d$difficulty, use.distance=FALSE)
  d$expected <- expected[d$mostLikelyOrigin]
  ob <- table(d$mostLikelyOrigin)
  d$observed <- ob[d$mostLikelyOrigin]
  w <- which(is.na(d$mostLikelyOrigin))
  d$observed[w] <- d$expected[w] <- 0
  list(knn=knn, d=d)

#' @importFrom stats quantile weighted.mean
.knnSmooth <- function(knn, score, use.distance=TRUE, type=NULL){
  w <- seq_len(ncol(knn$index))
    mind <- quantile(knn$distance[,1], probs=0.1)
    if(mind==0) mind <- 0.5
  vapply(seq_len(nrow(knn$index)), FUN.VALUE=numeric(1L), FUN=function(i){
    x <- knn$index[i,]
      w <- knn$type[i,]==type
    if(sum(w)==0) return(score[i])
    x <- x[w]
      weights <- mind+c(0,knn$distance[i,][w])
      weights <- 1/sqrt(weights)
      weights <- 1/seq_len(1+length(x))

#' @importFrom S4Vectors DataFrame metadata
#' @importFrom stats predict quantile
.scDblscore <- function(d, scoreType="xgb", nrounds=NULL, max_depth=6, iter=2,
                        threshold=TRUE, verbose=TRUE, dbr=NULL, features=NULL,
                        metric="logloss", BPPARAM=SerialParam(), ...){
  gdbr <- dbr
      sl <- sum(d$src=="real")
      ## estimate a global doublet rate
      sl <- as.numeric(table(d$sample, d$src=="real")[,2])
    gdbr <- (0.01*sl/1000)
    gdbr <- sum(gdbr*sl)/sum(sl)
  if(scoreType %in% c("xgb.local.optim","xgb")){
    if(verbose) message("Training model...")
    d$score <- NULL
      prds <- setdiff(colnames(d), c("mostLikelyOrigin","originAmbiguous",
                                     "distanceToNearestDoublet", "type",
      if(length(mis <- setdiff(features, colnames(d)))>0)
        warning("The following features were not found: ", 
                paste(mis,collapse=", "))
      prds <- setdiff(intersect(features, colnames(d)),c("type","src","class"))
    w <- which(d$type=="real")
    d$score <- 0
    d$score[w] <- ecdf(d$ratio[w])(d$ratio[w])
    if(!is.null(d$cxds_score)) d$score[w] <- d$score[w]+
    w <- which(d$type=="real" & d$score >= quantile(d$score[w], 1-gdbr))
    d$include.in.training <- TRUE
    d$include.in.training[w] <- FALSE
      # remove cells with a high chance of being doublets from the training
      w <- which(d$type=="real" & 
                   d$score >= quantile(d$score[which(d$type=="real")], 1-gdbr))
      d$score <- tryCatch({
        fit <- .xgbtrain(d[-w,prds], d$type[-w], nrounds, metric=metric,
        predict(fit, as.matrix(d[,prds]))
      }, error=function(e) d$score)
      iter <- iter-1

          d$score <- d$ratio
          d$score <- d$weighted
  d <- DataFrame(d)
      if(verbose) message("Finding threshold...")
      if(!is.null(d$sample) && is.null(dbr) && scoreType!="xgb.local.optim"){
          # per-sample thresholding
          th <- lapply(split(seq_len(nrow(d)), d$sample),
                           x <- d[x,c("cluster","src","type","mostLikelyOrigin",
                           dbr <- 0.01*sum(x$src=="real",na.rm=TRUE)/1000
                           doubletThresholding(x, local=FALSE, dbr=dbr, ...)
          th.stats <- lapply(th, FUN=function(x) x$stats)
          th <- vapply(th, FUN=function(x) x$th, FUN.VALUE=numeric(1))
          d$class <- ifelse(d$score >= th[d$sample], "doublet", "singlet")
          if(verbose) message("Thresholds found:\n", 
                                    collapse=", "))
          th <- doubletThresholding( d, local=scoreType=="xgb.local.optim", 
                                     dbr=gdbr, ... )
              d$score.global <- d$score
              d$score <- th$finalScores
          th.stats <- th$stats
          th <- th$th
          d$class <- ifelse(d$score >= th, "doublet", "singlet")
          if(verbose) message("Threshold found:", round(th,3))
      ## set class of known (i.e. inputted) doublets:
      d$class[d$src=="real" & d$type=="doublet"] <- "doublet"
      metadata(d)$scDblFinder.stats <- th.stats
      metadata(d)$scDblFinder.threshold <- th
      d$nearestClass <- factor(d$nearestClass, levels = 0:1, 
      dbr <- sum(d$class=="doublet" & d$src=="real")/sum(d$src=="real")
      if(verbose) message(sum(d$class=="doublet" & d$src=="real"), " (", 
                          round(100*dbr,1),"%) doublets called")

#' @import xgboost
.xgbtrain <- function(d2, ctype, nrounds=NULL, max_depth=6, nfold=5, 
                      subsample=0.6, nthreads=1, metric="logloss", ...){
  if(!is.integer(ctype)) ctype <- as.integer(ctype)-1
  d2 <- as.matrix(d2)
      # use cross-validation
      res <- xgb.cv(data=d2, label=ctype, nrounds=500, max_depth=max_depth, 
                    objective="binary:logistic", eval_metric=metric,
                    early_stopping_rounds=3, tree_method="hist", nfold=nfold,
                    metrics=list("aucpr","error"), subsample=subsample, 
                    nthread=nthreads, verbose=FALSE, ...)
      ni = res$best_iteration
      ac = res$evaluation_log$test_error_mean[ni] + 
          1 * res$evaluation_log$test_error_std[ni]
      nrounds = min(which(res$evaluation_log$test_error_mean <= ac))
  xgboost( d2, ctype, nrounds=nrounds, eval_metric=metric, 
           objective="binary:logistic", max_depth=max_depth,
           early_stopping_rounds=3, verbose=FALSE, nthread=nthreads,
           ... )

# add the relevant fields of the scDblFinder results table to the SCE
#' @importFrom stats relevel
.scDblAddCD <- function(sce, d){
  d <- d[colnames(sce),]
  for(f in c("sample","cluster","class","score","score.global","ratio",
    if(!is.null(d[[f]])) sce[[paste0("scDblFinder.",f)]] <- d[[f]]
  if(!is.null(sce$scDblFinder.class)) sce$scDblFinder.class <- 
  if(is(d,"DataFrame") && !is.null(metadata(d)$scDblFinder.stats))
      metadata(sce)$scDblFinder.stats <- metadata(d)$scDblFinder.stats

.checkSCE <- function(sce){
  if(is(sce, "SummarizedExperiment")){
    sce <- as(sce, "SingleCellExperiment")
  }else if(!is(sce, "SingleCellExperiment")){
    if(is.null(dim(sce)) || any(sce<0))
      stop("`sce` should be a SingleCellExperiment, a SummarizedExperiment, ",
           "or an array (i.e. matrix, sparse matric, etc.) of counts.")
    message("Assuming the input to be a matrix of counts or expected counts.")
    sce <- SingleCellExperiment(list(counts=sce))
  if( !("counts" %in% assayNames(sce)) ) 
      stop("`sce` should have an assay named 'counts'")
  counts(sce) <- as(counts(sce),"dgCMatrix")
      colnames(sce) <- paste0("cell",seq_len(ncol(sce)))
      row.names(sce) <- paste0("f",seq_len(nrow(sce)))

Try the scDblFinder package in your browser

Any scripts or data that you put into this service are public.

scDblFinder documentation built on Nov. 8, 2020, 5:48 p.m.