CB2FindCell: Main function of distinguish real cells from empty droplets...

Description Usage Arguments Details Value Examples

View source: R/CB2FindCell.R

Description

The main function of scCB2 package. Distinguish real cells from empty droplets using clustering-based Monte-Carlo test.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
CB2FindCell(
  RawDat,
  FDR_threshold = 0.01,
  lower = 100,
  upper = NULL,
  GeneExpressionOnly = TRUE,
  Ncores = 2,
  TopNGene = 30000,
  verbose = TRUE
)

Arguments

RawDat

Matrix. Supports standard matrix or sparse matrix. This is the raw feature-by-barcode count matrix.

FDR_threshold

Numeric between 0 and 1. Default: 0.01. The False Discovery Rate (FDR) to be controlled for multiple testing.

lower

Positive integer. Default: 100. All barcodes whose total count below or equal to this threshold are defined as background empty droplets. They will be used to estimate the background distribution. The remaining barcodes will be test against background distribution. If sequencing depth is deliberately made higher (lower) than usual, this threshold can be leveled up (down) correspondingly to get reasonable number of cells. Recommended sequencing depth for this default threshold: 40,000~80,000 reads per cell.

upper

Positive integer. Default: NULL. This is the upper threshold for large barcodes. All barcodes whose total counts are larger or equal to upper threshold are directly classified as real cells prior to testing. If upper = NULL, the knee point of the log rank curve of barcodes total counts will serve as the upper threshold, which is calculated using package DropletUtils's method. If upper = Inf, no barcodes will be retained prior to testing. If manually specified, it should be greater than pooling threshold.

GeneExpressionOnly

Logical. Default: TRUE. For 10x Cell Ranger version >=3, extra features (surface proteins, cell multiplexing oligos, etc) besides genes are measured simultaneously. If GeneExpressionOnly = TRUE, only genes are used for testing. Removing extra features are recommended because the default pooling threshold (100) is chosen only for handling gene expression. Extra features expression level is hugely different from gene expression level. If using the default pooling threshold while keeping extra features, the estimated background distribution will be hugely biased and does not reflect the real background distribution of empty droplets.

Ncores

Positive integer. Default: 2. Number of cores for parallel computation.

TopNGene

Positive integer. Default: 30000. Number of top highly expressed genes to use. This threshold avoids high number of false positives in ultra-high dimensional datasets, e.g. 10x barnyard data.

verbose

Logical. Default: TRUE. If verbose = TRUE, progressing messages will be printed.

Details

Input data is a feature-by-barcode matrix. Background barcodes are defined based on lower. Large barcodes are automatically treated as real cells based on upper. Remaining barcodes will be first clustered into subgroups, then tested against background using Monte-Carlo p-values simulated from Multinomial distribution. The rest barcodes will be further tested using EmptyDrops (Aaron T. L. Lun et. al. 2019). FDR is controlled based on FDR_threshold.

This function supports parallel computation. Ncores is used to specify number of cores.

Under CellRanger version >=3, extra features other than genes are simultaneously measured (e.g. surface protein, cell multiplexing oligo). We recommend filtering them out using GeneExpressionOnly = TRUE because the expression of extra features is not in the same scale as gene expression counts. If using the default pooling threshold while keeping extra features, the estimated background distribution will be hugely biased and does not reflect the real background distribution of empty droplets. The resulting matrix will contain lots of barcodes who have almost zero gene expression and relatively high extra features expression, which are usually not useful for RNA-Seq study.

Value

An object of class SummarizedExperiment. The slot assays contains the real cell barcode matrix distinguished during cluster-level test, single-barcode-level test plus large cells who exceed the upper threshold. The slot metadata contains (1) testing statistics (Pearson correlation to the background) for all candidate barcode clusters, (2) barcode IDs for all candidate barcode clusters, the name of each cluster is its median barcode size, (3) testing statistics (log likelihood under background distribution) for remaining single barcodes not clustered, (4) background distribution count vector without Good-Turing correction.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
# raw data, all barcodes
data(mbrainSub)
str(mbrainSub)

# run CB2 on the first 10000 barcodes
CBOut <- CB2FindCell(mbrainSub[,1:10000], FDR_threshold = 0.01, 
    lower = 100, Ncores = 2)
RealCell <- GetCellMat(CBOut, MTfilter = 0.05)

# real cells
str(RealCell)

scCB2 documentation built on Nov. 8, 2020, 5:48 p.m.