estimateNonExpressingCells: Calculate which cells genuinely do not express a particular...

View source: R/estimateNonExpressingCells.R

estimateNonExpressingCellsR Documentation

Calculate which cells genuinely do not express a particular gene or set of genes

Description

Given a list of correlated genes (e.g. Haemoglobin genes, Immunoglobulin genes, etc.), make an attempt to estimate which cells genuinely do not express each of these gene sets in turn. The central idea is that in cells that are not genuinely producing a class of mRNAs (such as haemoglobin genes), any observed expression of these genes must be due to ambient RNA contamination. As such, if we can identify these cells, we can use the observed level of expression of these genes to estimate the level of contamination.

Usage

estimateNonExpressingCells(
  sc,
  nonExpressedGeneList,
  clusters = NULL,
  maximumContamination = 1,
  FDR = 0.05
)

Arguments

sc

A SoupChannel object.

nonExpressedGeneList

A list containing sets of genes which will be used to estimate the contamination fraction.

clusters

A named vector indicating how to cluster cells. Names should be cell IDs, values cluster IDs. If NULL, we will attempt to load it from sc$metaData$clusters. If set to FALSE, each cell will be considered individually.

maximumContamination

The maximum contamination fraction that you would reasonably expect. The lower this value is set, the more aggressively cells are excluded from use in estimation.

FDR

A Poisson test is used to identify cells to exclude, this is the false discovery rate it uses. Higher FDR = more aggressive exclusion.

Details

The ideal way to do this would be to have a prior annotation of your data indicating which cells are (for instance) red blood cells and genuinely expression haemoglobin genes, and which do not and so only express haemoglobin genes due to contamination. If this is your circumstance, there is no need to run this function, you can instead pass a matrix encoding which cells are haemoglobin expressing and which are not to calculateContaminationFraction via the useToEst parameter.

This function will use a conservative approach to excluding cells that it thinks may express one of your gene sets. This is because falsely including a cell in the set of non-expressing cells may erroneously inflate your estimated contamination, whereas failing to include a genuine non-expressing cell in this set has no significant effect.

To this end, this function will exclude any cluster of cells in which any cell is deemed to have genuine expression of a gene set. Clustering of data is beyond the scope of this package, but can be performed by the user. In the case of 10X data mapped using cellranger and loaded using load10X, the cellranger graph based clustering is automatically loaded and used.

To decide if a cell is genuinely expressing a set of genes, a Poisson test is used. This tests whether the observed expression is greater than maximumContamination times the expected number of counts for a set of genes, if the cell were assumed to be derived wholly from the background. This process can be made less conservative (i.e., excluding fewer cells/clusters) by either decreasing the value of the maximum contamination the user believes is plausible (maximumContamination) or making the significance threshold for the test more strict (by reducing FDR).

Value

A matrix indicating which cells to be used to estimate contamination for each set of genes. Typically passed to the useToEst parameter of calculateContaminationFraction or plotMarkerMap.

See Also

calculateContaminationFraction plotMarkerMap

Examples

#Common gene list in real world data
geneList = list(HB=c('HBB','HBA2'))
#Gene list appropriate to toy data
geneList = list(CD7 = 'CD7')
ute = estimateNonExpressingCells(scToy,geneList)

constantAmateur/SoupX documentation built on Nov. 2, 2022, 10:16 a.m.