importDemux: Extracts Demuxlet information into a pre-made...
In dittoSeq: User Friendly Single-Cell and Bulk RNA Sequencing Visualization

Description Usage Arguments Details Value Metadata Added For data from multi-(droplet-gen-)lane scRNAseq Author(s) See Also Examples

View source: R/Demuxlet_Tools.R

Extracts Demuxlet information into a pre-made SingleCellExperiment or Seurat object

importDemux(
  object,
  raw.cell.names = NULL,
  lane.meta = NULL,
  lane.names = NA,
  demuxlet.best,
  trim.before_ = TRUE,
  bypass.check = FALSE,
  verbose = TRUE
)

`object`	A pre-made Seurat(v3+) or SingleCellExperiment object to add demuxlet information to.
`raw.cell.names`	A string vector consisting of the raw cell barcodes of the object as they would have been output by cellranger aggr. Format per cell.name = NNN...NNN-# where NNN...NNN are the cell barcode nucleotides, and # is the lane number. This input should be used when additional information has been added directly into the cell names outside of Seurat's standard merge prefix: "user-text_".
`lane.meta`	A string which names a metadata slot that contains which cells came from which droplet-generation wells.
`lane.names`	String vector which sets how the lanes should be named (if you want to give them something different from the default = Lane1, Lane2, Lane3...)
`demuxlet.best`	String or String vector pointing to the location(s) of the .best output file from running of demuxlet. Alternatively, a data.frame representing an already imported .best matrix.
`trim.before_`	Logical which sets whether any characters in front of an "_" should be deleted from the `raw.cell.names` before matching with demuxlet barcodes.
`bypass.check`	Logical which sets whether the function should run even when meta.data slots would be over-written.
`verbose`	whether to print messages about the stage of this process that is currently being run & also the summary at the end.

The function takes in a previously generated Seurat or SingleCellExperiment (SCE) object.

It also takes in demuxlet information either in the form of: (1) the location of a single demuxlet.best out file, (2) the locations of multiple demuxlet.best output files, (3) a user-constructed data.frame created by reading in a demuxlet.best file.

Then it matches barcodes and adds demuxlet-information to the Seurat or SCE as metadata.

For a note on how best to utilize this function with multi-lane droplet-based data, see the devoted section below.

Specifically:

1. If a metadata slot name is provided to lane.meta, information in that metadata slot is copied into a metadata slot called "Lane". Alternatively, if lane.meta is left as NULL, separate lanes are assumed to be marked by distinct values of "-#" at the end of cell names, as is the typical output of the 10X cellranger count & aggr pipeline.

(1a. If demuxlet.best was provided as a set of separate file locations (recommended usage in conjunction with 'cellranger aggr'), the "-#" at the ends of BARCODEs columns from these files are incremented on read-in so that they can match the incrementation applied by cellranger aggr. See the section on multi-lane scRNAseq for more.)

2. Barcodes in the demuxlet .best data are then matched to barcodes in the object. The cell names, colnames(object), are used by default for this matching, but if these have been modified from what would have been given to demuxlet – outside of -# at the end or ***_'s at the beginning, as can be added in common merge functions – raw.cell.names can be provided and these cell names used instead.

3. Singlet/doublet/ambiguous calls and sample identities (1st only for doublets) are parsed and carried into metadata.

4. Finally, a summary of the results including mean number of SNPs and percentages of singlets and doublets is output unless verbose is set to FALSE.

The Seurat or SingleCellExperiment object with metadata added for "Sample" calls and other relevant statistics.

Lane information and demuxlet calls and statistics are imported into the object as these metadata:

Lane = guided by lane.meta import input or "-#"s in barcodes, represents the separate droplet-generation lanes.
Sample = The sample call, parsed from the BEST column
demux.doublet.call = whether the sample was a singlet (SNG), doublet (DBL), or ambiguious (AMB), parsed from the BEST column
demux.RD.TOTL = RD.TOTL column
demux.RD.PASS = RD.PASS column
demux.RD.UNIQ = RD.UNIQ column
demux.N.SNP = N.SNP column
demux.PRB.DBL = PRB.DBL column
demux.barcode.dup = (Only generated when TRUEs will exist) whether a cell's barcode in the demuxlet.best refered to only 1 cell in the object. (When TRUE, indicates that cells from distinct lanes were interpretted together by demuxlet. These will often be mistakenly called as doublets.)

There are many different ways such data might initially be processed which will affect its accessibility to importDemux().

Initial Processing: 10X recommends running cellranger counts individually for each well/lane. Non-10X droplet-based data from separate lanes should also be processed separately, at least for the steps of collecting reads for individual cells. NOT processing such droplet lanes separately will create artificial doublets from cells that ended up with similar barcodes, but in separate droplet-gen lanes. Thus, proper processing initially leads to creation of separate counts matrices for each droplet-generation lane.

Combining data from each lane: These per-lane counts matrices can be combined in various ways. All options will alter the cell barcode names in a way that makes them unique across lanes, but this uniquification is achieved varies.

Counts table combination methods generally do not adjust adjust BAM files – specifically the cell names embedded within the BAM files which is demuxlet uses for its BARCODEs column. Thus cell names data may needs to be modified in a proper way in order to make the object's cell names and demuxlet.best's BARCODEs match.

Running Demuxlet: Demuxlet should also be run, separately, on the BAM files of each individual lane. Imporperly running demuxlet on a combined BAM file can lead to loss of lane information and then to generation of artificial doublet calls for cells of distinct wells that received simiar barcodes. The BAM file associated with each demuxlet run is what is used for generating the BARCODE column of the demuxlet output.

How importDemux() handles barcode matching: importDemux is built to work with the 'cellranger aggr' pipeline by default, but can be used for demuxlet datasets processed differently as well (Option 2).

Option 1: When you merge matrices of all lanes with cellranger aggr before R import, aggr's barcode uniquification method is to increment a "-1", "-2", "-3", ... "-#" that is appended to the end of all barcode names. The number is incremented for each succesive lane. Note that lane-numbers depend on the order in which they were supplied to cellranger aggr.
- to use: Simply supply a demuxlet.best a vector containing the locations of the sepearate '.best' outputs for each lane, in the same order that lanes were provided to aggr.
  
  importDemux will adjust the "-#" in the demuxlet.best BARCODEs automatically before performing the matching step.
Option 2: When you instead import your counts data into a Seurat or SingleCellExperiment, and then merge the separate objects into one, the uniquifiction method is dependent on your particular method.
- to use: For these methods, it is easiest to 1) import your counts data, 2) transfer in your demuxlet info with importDemux() to each lane's object idividually (You can supply unique lane identifiers to the lane.names input.), and then 3) merge the separate objects.
Extra notes for any alternative cases:
- For Seurat's merge(), user-defined strings can be appended to the start of the barcodes, followed by an "_". By default, importDemux() will ignore these, but such ignorance can be controlled with the trim.before_ input.
- Alternatively, cell names that are consistent with the demuxlet.best BARCODEs can be supplied to the raw.cell.names input.

Daniel Bunis

Included QC visualizations:

demux.calls.summary for plotting the number of sample annotations assigned within each lane.

demux.SNP.summary for plotting the number of SNPs measured per cell.

Or, see Kang et al. Nature Biotechnology, 2018 https://www.nature.com/articles/nbt.4042 for more information about the demuxlet cell-sample deconvolution method.

#Prep: loading in an example dataset and sample demuxlet data
example("importDittoBulk", echo = FALSE)
demux <- demuxlet.example
colnames(myRNA) <- demux$BARCODE[seq_len(ncol(myRNA))]

###
### Method 1: Lanes info stored in a metadata
###

# Notice there is a groups metadata in this Seurat object.
getMetas(myRNA)
# We will treat these as if that holds Lane information

# Now, running importDemux:
myRNA <- importDemux(
    myRNA,
    lane.meta = "groups",
    demuxlet.best = demux)

# Note, importDemux can also take in the location of the .best file.
#   myRNA <- importDemux(
#       object = myRNA,
#       lane.meta = "groups",
#       demuxlet.best = "Location/filename.best")

# demux.SNP.summary() and demux.calls.summary() can now be used.
demux.SNP.summary(myRNA)
demux.calls.summary(myRNA)

###
### Method 2: cellranger aggr combined data (denoted with "-#" in barcodes)
###

# If cellranger aggr was used, lanes will be denoted by "-1", "-2", ... "-#"
#   at the ends of Seurat cellnames.
# Demuxlet should be run on each lane individually.
# Provided locations of each demuxlet.best output file, *in the same order
#   that lanes were provided to cellranger aggr* this function will then
#   adjust the "-#" within the .best BARCODEs automatically before matching
#
# myRNA <- importDemux(
#     object = myRNA,
#     demuxlet.best = c(
#         "Location/filename1.best",
#         "Location/filename2.best"),
#     lane.names = c("g1","g2"))