cleanTagCounts: Clean a tag-based dataset
In MarioniLab/DropletUtils: Utilities for Handling Single-Cell Droplet Data

cleanTagCounts

R Documentation

Clean a tag-based dataset

Description

Remove low-quality libraries from a count matrix where each row is a tag and each column corresponds to a cell-containing barcode.

Usage

cleanTagCounts(x, ...)

## S4 method for signature 'ANY'
cleanTagCounts(
  x,
  controls,
  ...,
  ambient = NULL,
  exclusive = NULL,
  sparse.prop = 0.5
)

## S4 method for signature 'SummarizedExperiment'
cleanTagCounts(x, ..., assay.type = "counts")

Arguments

`x`	A numeric matrix-like object containing counts for each tag (row) in each cell (column). Alternatively, a SummarizedExperiment containing such a matrix.
`...`	For the generic, further arguments to pass to individual methods. For the SummarizedExperiment, further arguments to pass to the ANY method. For the ANY method, further arguments to pass to `isOutlier`. This includes `batch` to account for multi-batch experiments, and `nmads` to specify the stringency of the outlier-based filter.
`controls`	A vector specifying the rows of `x` corresponding to control tags. These are expected to be isotype controls that should not exhibit any real binding.
`ambient`	A numeric vector of length equal to `nrow(x)`, containing the relative concentration of each tag in the ambient solution. Defaults to `ambientProfileBimodal(x)` if not explicitly provided.
`exclusive`	A character vector of names of mutually exclusive tags that should never be expressed on the same cell. Alternatively, a list of vectors of mutually exclusive sets of tags - see `ambientContribNegative` for details.
`sparse.prop`	Numeric scalar specifying the minimum proportion of tags that should be present per cell.
`assay.type`	Integer or string specifying the assay containing the count matrix.

Details

We remove cells for which there is no detectable ambient contamination. Specifically, we expect non-zero counts for most tags due to the deeply sequenced nature of tag-based data. If sparse.prop or more tags have zero counts, this is indicative of a failure in library preparation for that cell.

We also remove cells for which the total control count is unusually high. The control coverage is used as a proxy for non-specific binding, most notably from contamination of droplets by protein aggregates. High levels of non-specific activity are undesirable as this masks the actual marker profile of affected cells. The upper threshold is defined with isOutlier on the log-total control count.

If controls is missing, we instead compute the ambient scaling factor for each cell. This represents the amount of ambient contamination - see ?ambientContribSparse for more details - and cells with unusually high values are assumed to be affected by protein aggregates. High outliers are again identified and removed based on the log-ambient scale.

If controls is missing and exclusive is specified, the ambient scaling factor is computed by ambientContribNegative instead. This can be helpful for explicitly removing cells with impossible marker combinations, though it is only as comprehensive as the knowledge of mutually exclusive marker sets.

Value

A DataFrame with one row per column of x, containing the following fields:

zero.ambient, a logical field indicating whether each cell has zero ambient contamination.
sum.controls, a numeric field containing the sum of counts for all control features. Only present if controls is supplied.
high.controls, a logical field indicating whether each cell has unusually high control total. Only present if controls is supplied.
ambient.scale, a numeric field specifying the relative amount of ambient contamination. Only present if controls is not supplied.
high.ambient, a numeric field indicating whether each cell has unusually high ambient contamination. Only present if controls is not supplied.
discard, a logical field indicating whether a column in x should be discarded.

Author(s)

Aaron Lun

Examples

x <- rbind(
    rpois(1000, rep(c(100, 10), c(100, 900))),
    rpois(1000, rep(c(20, 100, 20), c(100, 100, 800))),
    rpois(1000, rep(c(30, 100, 30), c(200, 700, 100)))
)

# Adding a zero-ambient column plus a high-ambient column.
x <- cbind(0, x, 1000)

df <- cleanTagCounts(x)
df

MarioniLab/DropletUtils documentation built on July 16, 2025, 1:57 p.m.