chimericDrops: Remove chimeric molecules

Description Usage Arguments Details Value Author(s) References Examples

View source: R/chimericDrops.R

Description

Remove chimeric molecules within each cell barcode's library in a droplet experiment.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
chimericDrops(sample, barcode.length = NULL, use.library = NULL, ...)

removeChimericDrops(
  cells,
  umis,
  genes,
  nreads,
  ref.genes,
  min.frac = 0.8,
  get.chimeric = FALSE,
  get.diagnostics = FALSE
)

Arguments

sample

String containing paths to the molecule information HDF5 files, produced by CellRanger for 10X Genomics data.

barcode.length

An integer scalar specifying the length of the cell barcode, see read10xMolInfo.

use.library

An integer vector specifying the library indices for which to extract molecules from sample. Alternatively, a character string specifying one or more library types, e.g., "Gene expression".

...

Further arguments to be passed to removeChimericDrops.

cells

Character vector containing cell barcodes, where each entry corresponds to one molecule.

umis

Integer vector containing encoded UMI sequences, see ?encodeSequences for details.

genes

Integer vector specifying the gene indices. Each index should refer to an element of ref.genes.

nreads

Integer vector containing the number of reads per molecule.

ref.genes

A character vector containing the names or symbols of all genes.

min.frac

A numeric scalar specifying the minimum fraction of reads required for a chimeric molecule to be retained.

get.chimeric

A logical scalar indicating whether the UMI counts corresponding to chimeric molecules should be returned.

get.diagnostics

A logical scalar indicating whether to return statistics for each molecule grouping.

Details

Chimeric molecules are occasionally generated during library preparation for highly multiplexed droplet experiments. Here, incomplete PCR products from one molecule hybridise to another molecule for extension using shared sequences like the poly-A tail for 3' protocols. This produces an amplicon where the UMI and cell barcode originate from one transcript molecule but the gene sequence is from another. If the second template is from another cell, this effect results in contamination of one cell's profile by another, similar to the contamination between samples discussed in ?swappedDrops.

Chimerism manifests as molecules that have the same UMI sequence and cell barcode but are assigned to different genes. To remove them, this function will simply discard all molecules within the same cell that share UMI sequences. Of course, this may also remove non-chimeric molecules that have the same UMI by chance, but for typical UMI lengths (10-12 bp for 10X protocols) we expect UMI collisions to be very rare between molecules from the same cell.

Nonetheless, to mitigate losses due to collisions, we retain any molecule that has a much greater number of reads compared to all other molecules with the same UMI in the same cell. This is based on the expectation that the original non-chimeric molecule will have undergone more rounds of PCR amplification compared to its chimeric offspring, and thus will have higher read coverage. For all molecules with the same UMI within a given cell, we compute the proportion of reads assigned to each molecule and we keep the molecule with a proportion above min.frac. If no molecule passes this threshold, the entire set is discarded.

The use.library argument can be used to only check for chimeras within a given feature type, e.g., CRISPR tags. This is most relevant in situations where sample contains multiple libraries that involve different sets of shared sequences, such that chimeras are unlikely to form between molecules from different libraries. Analysis of just one library can be achieved by setting use.library to the name or index of the desired feature set.

Value

A list is returned with the cleaned entry, a sparse matrix containing the UMI count for each gene (row) and cell barcode (column) after removing chimeric molecules. All cell barcodes that were originally observed are reported as columns, though note that it is theoretically possible for some barcodes to contain no counts.

If get.chimeric=TRUE, a chimeric entry is returned in the list. This is a sparse matrix of UMI counts corresponding to the chimeric molecules. Adding the cleaned and chimeric matrices should yield the total UMI count prior to removal of swapped molecules.

If get.diagnostics=TRUE, the top-level list will also contain an additional diagnostics DataFrame. Each row corresponds to a group of molecules in the same cell with the same UMI. The DataFrame holds the number of molecules in the group, the sum of reads across all molecules in the group, and the proportion of reads assigned to the most sequenced molecule.

Author(s)

Aaron Lun

References

Dixit A. (2016). Correcting chimeric crosstalk in single cell RNA-seq experiments. biorXiv, https://doi.org/10.1101/093237

Examples

1
2
3
4
5
6
7
8
# Mocking up some 10x HDF5-formatted data.
curfile <- DropletUtils:::simBasicMolInfo(tempfile())

out <- chimericDrops(curfile)
dim(out$cleaned)

out2 <- chimericDrops(curfile, get.diagnostics=TRUE)
out2$diagnostics

DropletUtils documentation built on Feb. 4, 2021, 2:01 a.m.