swappedDrops: Clean barcode-swapped droplet data

View source: R/swappedDrops.R

swappedDropsR Documentation

Clean barcode-swapped droplet data

Description

Remove the effects of barcode swapping on droplet-based single-cell RNA-seq data, specifically 10X Genomics datasets.

Usage

swappedDrops(samples, barcode.length = NULL, use.library = NULL, ...)

removeSwappedDrops(
  cells,
  umis,
  genes,
  nreads,
  ref.genes,
  min.frac = 0.8,
  get.swapped = FALSE,
  get.diagnostics = FALSE,
  hdf5.out = FALSE
)

Arguments

samples

A character vector containing paths to the molecule information HDF5 files, produced by CellRanger for 10X Genomics data. Each file corresponds to one sample in a multiplexed pool.

barcode.length

An integer scalar specifying the length of the cell barcode, see read10xMolInfo.

use.library

An integer scalar specifying the library index for which to extract molecules from sample. Alternatively, a string specifying the library type, e.g., "Gene expression".

...

Further arguments to be passed to removeSwappedDrops.

cells

A list of character vectors containing cell barcodes. Each vector corresponds to one sample in a multiplexed pool, and each entry of the vector corresponds to one molecule.

umis

A list of integer vectors containing encoded UMI sequences, organized as described for cells. See ?encodeSequences to convert sequences to integers.

genes

A list of integer vectors specifying the gene indices, organized as described for cells. Each index should refer to an element of ref.genes.

nreads

A list of integer vectors containing the number of reads per molecule, organized as described for cells.

ref.genes

A character vector containing the names or symbols of all genes.

min.frac

A numeric scalar specifying the minimum fraction of reads required for a swapped molecule to be assigned to a sample.

get.swapped

A logical scalar indicating whether the UMI counts corresponding to swapped molecules should be returned.

get.diagnostics

A logical scalar indicating whether to return the number of reads for each molecule in each sample.

hdf5.out

Deprecated and ignored.

Details

Barcode swapping on the Illumina sequencer occurs when multiplexed samples undergo PCR re-amplification on the flow cell by excess primer with different barcodes. This results in sequencing of the wrong sample barcode and molecules being assigned to incorrect samples after debarcoding. With droplet data, there is the opportunity to remove such effects based on the combination of gene, UMI and cell barcode for each observed transcript molecule. It is very unlikely that the same combination will arise from different molecules in multiple samples. Thus, observation of the same combination across multiple samples is indicative of barcode swapping.

We can remove swapped molecules based on the number of reads assigned to each gene-UMI-barcode combination. From the total number of reads assigned to that combination, the fraction of reads in each sample is calculated. The sample with the largest fraction that is greater than min.frac is defined as the putative sample of origin to which the molecule is assigned. This assumes that the swapping rate is low, so the sample of origin for a molecule should contain the majority of the reads. In other all samples, reads for the combination are assumed to derive from swapping and do not contribute to the count matrix. Setting min.frac=1 will effectively remove all molecules that appear in multiple samples. We do not recommend setting min.frac lower than 0.5.

If diagnostics=TRUE, a diagnostics matrix is returned containing the number of reads per gene-UMI-barcode combination in each sample. Each row corresponds to a combination and each column corresponds to a sample. This can be useful for examining the level of swapping across samples on a molecule-by-molecule basis, though for the sake of memory, the actual identity of the molecules is not returned. By default, the matrix is returned as a HDF5Matrix, which reduces memory usage and avoids potential issues with integer overflow. If hdf5.out=FALSE, a sparse matrix is returned instead, which is faster but uses more memory.

swappedDrops is a wrapper around removeSwappedDrops that extracts the relevant data from the 10X Genomics molecule information file. For other types of droplet-based data, it may be more convenient to call removeSwappedDrops directly.

Value

A list is returned with the cleaned entry, itself a list of sparse matrices. Each matrix corresponds to a sample and contains the UMI count for each gene (row) and cell barcode (column) after removing swapped molecules. All cell barcodes that were originally observed are reported as columns, though note that it is possible for some barcodes to contain no counts.

If get.swapped=TRUE, a swapped entry is returned in the top-level list. This is a list containing sample-specific sparse matrices of UMI counts corresponding to the swapped molecules. Adding the cleaned and swapped matrices for each sample should yield the total UMI count prior to removal of swapped molecules.

If get.diagnostics=TRUE, the top-level list will also contain an additional diagnostics matrix.

Format of the molecule information file

swappedDrops makes a few assumptions about the nature of the data in each molecule information file. These are necessary to simplify downstream processing and are generally acceptable in most cases.

Each molecule information file should contain data from only a single 10X run. Users should not combine multiple samples into a single molecule information file. The function will emit a warning upon detecting multiple GEM groups from any molecule information file. Molecules with different GEM groups will not be recognised as coming from a different sample, though they will be recognised as being derived from different cell-level libraries.

In files produced by CellRanger version 3.0, an additional per-molecule field is present indicating the (c)DNA library from which the molecule was derived. Library preparation can be performed separately for different features (e.g., antibodies, CRISPR tags) such that one 10X run can contain data from multiple libraries. This allows for arbitrarily complicated multiplexing schemes - for example, gene expression libraries might be multiplexed together across one set of samples, while the antibody-derived libraries might be multiplexed across another different set of samples. For simplicity, we assume that multiplexing was performed across the same set of samples for all libraries therein.

If a different multiplexing scheme was applied for each library type, users can set use.library to only check for swapping within a given library type(s). For example, if the multiplexed set of samples for the gene expression libraries is different from the multiplexed set for the CRISPR libraries, one could run swappedDrops separately on each set of samples with use.library set to the corresponding type. This avoids having to take the union of both sets of samples for a single swappedDrops run, which could detect spurious swaps between samples that were never multiplexed together for the same library type.

Author(s)

Jonathan Griffiths, with modifications by Aaron Lun

References

Griffiths JA, Lun ATL, Richard AC, Bach K, Marioni JC (2018). Detection and removal of barcode swapping in single-cell RNA-seq data. Nat. Commun. 9, 1:2667.

See Also

read10xMolInfo, encodeSequences

Examples

# Mocking up some 10x HDF5-formatted data, with swapping.
curfiles <- DropletUtils:::simSwappedMolInfo(tempfile(), nsamples=3)

# Obtaining count matrices with swapping removed.
out <- swappedDrops(curfiles)
lapply(out$cleaned, dim)

out <- swappedDrops(curfiles, get.swapped=TRUE, get.diagnostics=TRUE)
names(out)


MarioniLab/DropletUtils documentation built on Nov. 12, 2023, 6:13 a.m.