CB2FindCell | R Documentation |
The main function of scCB2
package. Distinguish real cells
from empty droplets using clustering-based Monte-Carlo test.
CB2FindCell(
RawDat,
FDR_threshold = 0.01,
lower = 100,
upper = NULL,
GeneExpressionOnly = TRUE,
Ncores = 2,
TopNGene = 30000,
verbose = TRUE
)
RawDat |
Matrix. Supports standard matrix or sparse matrix. This is the raw feature-by-barcode count matrix. |
FDR_threshold |
Numeric between 0 and 1. Default: 0.01. The False Discovery Rate (FDR) to be controlled for multiple testing. |
lower |
Positive integer. Default: 100. All barcodes whose total count below or equal to this threshold are defined as background empty droplets. They will be used to estimate the background distribution. The remaining barcodes will be test against background distribution. If sequencing depth is deliberately made higher (lower) than usual, this threshold can be leveled up (down) correspondingly to get reasonable number of cells. Recommended sequencing depth for this default threshold: 40,000~80,000 reads per cell. |
upper |
Positive integer. Default: |
GeneExpressionOnly |
Logical. Default: |
Ncores |
Positive integer. Default: 2. Number of cores for parallel computation. |
TopNGene |
Positive integer. Default: 30000. Number of top highly expressed genes to use. This threshold avoids high number of false positives in ultra-high dimensional datasets, e.g. 10x barnyard data. |
verbose |
Logical. Default: |
Input data is a feature-by-barcode matrix. Background barcodes are
defined based on lower
. Large barcodes are
automatically treated as real cells based on upper
. Remaining
barcodes will be first clustered into subgroups, then
tested against background using Monte-Carlo p-values simulated from
Multinomial distribution. The rest barcodes will be further tested
using EmptyDrops (Aaron T. L. Lun et. al. 2019).
FDR is controlled based on FDR_threshold
.
This function supports parallel computation. Ncores
is used to specify
number of cores.
Under CellRanger version >=3, extra features other than genes are
simultaneously measured (e.g. surface protein, cell multiplexing oligo).
We recommend filtering them out using
GeneExpressionOnly = TRUE
because the expression of
extra features is not in the same scale as gene expression counts.
If using the default pooling threshold while keeping extra features, the
estimated background distribution will be hugely biased and does not
reflect the real background distribution of empty droplets. The resulting
matrix will contain lots of barcodes who have almost zero gene expression
and relatively high extra features expression, which are usually not useful for
RNA-Seq study.
An object of class SummarizedExperiment
. The slot
assays
contains the real cell barcode matrix distinguished during
cluster-level test, single-barcode-level test plus large cells who
exceed the upper threshold. The slot metadata
contains
(1) testing statistics (Pearson correlation to the background) for all
candidate barcode clusters, (2) barcode IDs for all candidate barcode
clusters, the name of each cluster is its median barcode size,
(3) testing statistics (log likelihood under background distribution)
for remaining single barcodes not clustered, (4) background distribution
count vector without Good-Turing correction.
# raw data, all barcodes
data(mbrainSub)
str(mbrainSub)
# run CB2 on the first 10000 barcodes
CBOut <- CB2FindCell(mbrainSub[,1:10000], FDR_threshold = 0.01,
lower = 100, Ncores = 2)
RealCell <- GetCellMat(CBOut, MTfilter = 0.05)
# real cells
str(RealCell)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.