Description Usage Arguments Details Value Adjusting abundances for ambient contamination Computing the logfold changes Use only on nonempty droplets Handling 2 or fewer samples Resolving combinatorial hashes Author(s) References See Also Examples
Demultiplex cell barcodes into their samples of origin based on the most abundant hash tag oligo (HTO). Also identify potential doublets based on the presence of multiple significant HTOs.
1 2 3 4 5 6 7 8 9 10 11 12  hashedDrops(
x,
ambient = NULL,
min.prop = 0.05,
pseudo.count = 5,
doublet.nmads = 3,
doublet.min = 2,
doublet.mixture = FALSE,
confident.nmads = 3,
confident.min = 2,
combinations = NULL
)

x 
A numeric/integer matrixlike object containing UMI counts. Rows correspond to HTOs and columns correspond to cell barcodes. Each barcode is assumed to correspond to a cell, i.e., cell calling is assumed to have already been performed. 
ambient 
A numeric vector of length equal to 
min.prop 
Numeric scalar to be used to infer the ambient profile when 
pseudo.count 
A numeric scalar specifying the minimum pseudocount when computing logfold changes. 
doublet.nmads 
A numeric scalar specifying the number of median absolute deviations (MADs) to use to identify doublets. 
doublet.min 
A numeric scalar specifying the minimum threshold on the logfold change to use to identify doublets. 
doublet.mixture 
Logical scalar indicating whether to use a 2component mixture model to identify doublets. 
confident.nmads 
A numeric scalar specifying the number of MADs to use to identify confidently assigned singlets. 
confident.min 
A numeric scalar specifying the minimum threshold on the logfold change to use to identify singlets. 
combinations 
An integer matrix specifying valid combinations of HTOs.
Each row corresponds to a single sample and specifies the indices of rows in 
Note that this function is still experimental; feedback is welcome.
The idea behind cell hashing is that cells from the same sample are stained with reagent conjugated with a single HTO. Cells are mixed across multiple samples and subjected to dropletbased singlecell sequencing. Cell barcode libraries can then be demultiplexed into individual samples based on whether their unique HTO is detected.
We identify the sample of origin for each cell barcode as that corresponding to the most abundant HTO. (See below for more details on exactly how “most abundant” is defined.) The logfold change between the largest and secondlargest abundances is also reported for each barcode, with large logfold changes representing confident assignment to a single sample. We also report the logfold change of the secondmost abundant HTO over the estimated level of ambient contamination. Large logfold changes indicate that the second HTO has greater abundance than expected, consistent with a doublet.
To facilitate quality control, we explicitly identify problematic barcodes as outliers on the relevant metrics.
By default, we identify putative doublets as those with LogFC2
values that are
(i) doublet.nmads
MADs above the median and (ii) greater than doublet.min
.
The hard threshold is moreorless arbitrary and aims to avoid overly aggressive detection
of large outliers in a naturally rightskewed distribution
(given that the logfold changes are positive by definition, and most of the distribution is located near zero).
Alternatively, if doublet.mixture=TRUE
, we fit a twocomponent mixture model to the LogFC2
distribution.
Doublets are identified as all members of the component with the larger mean.
This avoids the need for the arbitrary parameters mentioned above but only works when there are many doublets,
otherwise both components will be fitted to the nondoublet values.
(Initialization of the model assumes at least 5% doublets.)
Of the nondoublet libraries, we consider them to be confidently assigned to a single sample if their LogFC
values are (i) not less than confident.nmads
MADs below the median and (ii) greater than confident.min
.
The hard threshold is again arbitrary, but this time it aims to avoid insufficiently aggressive outlier detection 
typically from an inflation of the MAD when the LogFC
values are large, positive and highly variable.
A DataFrame with one row per column of x
, containing the following fields:
Total
, integer specifying the total count across all HTOs for each barcode.
Best
, integer specifying the HTO with the highest abundance for each barcode.
Second
, integer specifying the HTO with the secondhighest abundance.
LogFC
, numeric containing the logfold change between the abundances of the best and secondbest HTO.
LogFC2
, numeric containing the logfold change in the secondbest HTO over the ambient contamination.
Doublet
, logical specifying whether a barcode is a doublet.
Confident
, logical specifying whether a barcode is a confidently assigned singlet.
In addition, the metadata contains ambient
, a numeric vector containing the (estimate of the) ambient profile;
doublet.threshold
, the threshold applied to LogFC2
to identify doublets;
and confident.threshold
, the threshold applied to nondoublet LogFC
values to identify confident singlets.
If combinations
is specified, Best
instead specifies the sample (i.e., row index of combinations
).
The interpretation of LogFC
and LogFC2
are slightly different, and Second
is not reported  see details.
HTO abundances require some care to compute due to the presence of ambient contamination in each library. Ideally, the experiment would be performed in such a manner that the concentration of each HTO is the same. However, if one HTO is present at higher concentration in the ambient solution, this might incorrectly cause us to assign all barcodes to the corresponding sample.
To adjust for ambient contamination, we assume that the ambient contamination in each library follows the same profile as ambient
.
We further assume that a minority of HTOs in a library are actually driven by the presence of cell(s), the rest coming from the ambient solution.
We estimate the level of ambient contamination in each barcode by scaling ambient
, using a DESeqlike normalization algorithm to compute the scaling factor.
(The requisite assumption of a nonDE majority follows from the two assumptions above.)
We then subtract the scaled ambient proportions from the HTO count profile to remove the effect of contamination.
Abundances that would otherwise be negative are set to zero.
The scaling factor for each cell is defined by computing ratios between the HTO counts and ambient
, and taking the median across all HTOs.
However, this strict definition is only used when there are at least 5 HTOs being considered.
For experiments with 34 HTOs, we assume that higherorder multiplets are negligible and define the scaling factor as the thirdlargest ratio.
For experiments with only 2 HTOs, the secondmost abundant HTO is always used to estimate the ambient contamination.
Ideally, ambient
would be obtained from libraries that do not correspond to cellcontaining droplets.
For example, we could get this information from the metadata
of the emptyDrops
output,
had we run emptyDrops
on the HTO count matrix (see below).
Unfortunately, in some cases (e.g., public data), counts are provided for only the cellcontaining barcodes.
If ambient=NULL
, the profile is inferred from x
using inferAmbience
.
After subtraction of the ambient noise but before calculation of the logfold changes,
we need to add a pseudocount to ensure that the logfold changes are welldefined.
We set the pseudocount to the average ambient HTO count (i.e., the average of the scaled ambient
), effectively exploiting the ambient contamination as a natural pseudocount that scales with barcodespecific capture efficiency and sequencing depth.
(In libraries with low sequencing depth, we still enforce a minimum pseudocount of pseudo.count
.)
This scaling behavior is useful as it ensures that shrinkage of the logfold changes is not more severe for libraries that have not been sequenced as deeply. We thus avoid excessive variability in the logfold change distribution that would otherwise reduce the precision of outlier detection. The implicit assumption is that the number of contaminating transcript molecules is roughly the same in each droplet, meaning that any differences in ambient coverage between libraries reflect technical biases that would also affect cellderived HTO counts.
Another nice aspect of this entire procedure (subtraction and readdition) is that it collapses to a noop if the experiment is wellexecuted with identical concentrations of all HTOs in the ambient solution.
This function assumes that cell calling has already been performed, e.g., with emptyDrops
.
Specifically, x
should only contain columns corresponding to nonempty droplets.
If empty droplets are included, their logfold changes will simply reflect stochastic sampling in the ambient solution
and violate the assumptions involved in outlier detection.
If x
contains columns for both empty and nonempty droplets,
it is straightforward to simply run emptyDrops
on the HTO count matrix to identify the latter.
Note that some fiddling with the lower=
argument may be required,
depending on the sequencing depth of the HTO libraries.
If x
has no more than two rows, Doublet
, LogFC2
and doublet.threshold
are set to NA
.
Strictly speaking, doublet detection is not possible as the second HTO is always used to estimate the ambient scaling and thus LogFC2
is always zero.
However, Confident
calls are still available in the output of this function so assignment to the individual samples can still be performed.
In this scenario, the nonconfident assignments are probably also doublets, though this cannot be said with much certainty.
If x
has no more than one row, Confident
, LogFC
and confident.threshold
are set to NA
.
Obviously, if there is only one HTO, the identity of the assigned sample is a foregone conclusion.
In some applications, samples are labelled with a combination of HTOs to enable achieve greater multiplexing throughput.
This is accommodated by passing combinations
to specify the valid HTO combinations that were used for sample labelling.
Each row of combinations
corresponds to a sample and should contain nonduplicated row indices of x
corresponding to the HTOs used in that sample.
The calculation for the singleHTO case is then generalized for HTO combinations. The most important differences are that:
The reported LogFC
is now the logfold change between the nth most abundant HTO and the n+1th HTO,
where n is the number of HTOs in a valid combination.
This captures the dropoff in abundance beyond the expected number of HTOs.
The reported LogFC2
is now the logfold change of the n+1th HTO over the ambient solution.
This captures the high abundance of the morethanexpected number of HTOs when doublets are present.
Best
no longer refers to the row index of x
, but instead to the row index of combinations
.
This may contain NA
values if a particular combination of HTOs is observed but not present in the expected set.
Second
is no longer reported as we cannot conveniently determine the identity of the second sample.
We also generalize the edgecase behavior when there are not enough HTOs to support doublet detection.
Consider that an intersample doublet may result in either n + 1 to 2n abundant HTOs.
Estimation of the scaling factor will attempt to avoid using the top 2n ratios.
If nrow(x)
is equal to or less than n + 1, doublet statistics will not be reported at all,
i.e., Doublet
and LogFC2
are set to NA
.
Aaron Lun
Stoeckius M, Zheng S, HouckLoomis B et al. (2018) Cell Hashing with barcoded antibodies enables multiplexing and doublet detection for single cell genomics. Genome Biol. 19, 1:224
emptyDrops
, to identify which barcodes are likely to contain cells.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  # Mocking up an example dataset with 10 HTOs and 10% doublets.
ncells < 1000
nhto < 10
y < matrix(rpois(ncells*nhto, 50), nrow=nhto)
true.sample < sample(nhto, ncells, replace=TRUE)
y[cbind(true.sample, seq_len(ncells))] < 1000
ndoub < ncells/10
next.sample < (true.sample[1:ndoub] + 1) %% nrow(y)
next.sample[next.sample==0] < nrow(y)
y[cbind(next.sample, seq_len(ndoub))] < 500
# Computing hashing statistics.
stats < hashedDrops(y)
# Doublets show up in the topleft, singlets in the bottom right.
plot(stats$LogFC, stats$LogFC2)
# Most cells should be singlets with low second logfold changes.
hist(stats$LogFC2, breaks=50)
# Identify confident singlets or doublets at the given threshold.
summary(stats$Confident)
summary(stats$Doublet)
# Checking against the known truth, in this case
# 'Best' contains the putative sample of origin.
table(stats$Best, true.sample)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.