calculateContaminationFraction: Calculate the contamination fraction

View source: R/calculateContaminationFraction.R

calculateContaminationFractionR Documentation

Calculate the contamination fraction

Description

This function computes the contamination fraction using two user-provided bits of information. Firstly, a list of sets of genes that can be biologically assumed to be absent in at least some cells in your data set. For example, these might be haemoglobin genes or immunoglobulin genes, which should not be expressed outside of erythroyctes and antibody producing cells respectively.

Usage

calculateContaminationFraction(
  sc,
  nonExpressedGeneList,
  useToEst,
  verbose = TRUE,
  forceAccept = FALSE
)

Arguments

sc

A SoupChannel object.

nonExpressedGeneList

A list containing sets of genes which can be assumed to be non-expressed in a subset of cells (see details).

useToEst

A boolean matrix of dimensions ncol(toc) x length(nonExpressedGeneList) indicating which gene-sets should not be assumed to be non-expressed in each cell. Row names must correspond to the names of nonExpressedGeneList. Usually produced by estimateNonExpressingCells.

verbose

Print best estimate.

forceAccept

Passed to setContaminationFraction.

Details

Secondly, this function needs to know which cells definitely do not express the gene sets described above. Continuing with the haemoglobin example, which are the erythrocytes that are producing haemoglobin mRNAs and which are non-erythrocytes that we can safely assume produce no such genes. The assumption made is any expression from a gene set in cell marked as a "non-expressor" for that gene set, must be derived from the soup. Therefore, the level of contamination present can be estimated from the amount of expression of these genes seen in these cells.

Most often, the genesets are user supplied based on your knowledge of the experiment and the cells in which they are genuinely expressed is estimated using estimateNonExpressingCells. However, they can also be supplied directly if other information is available.

Usually, there is very little variation in the contamination fraction within a channel and very little power to detect the contamination accurately at a single cell level. As such, the default mode of operation simply estimates one value of the contamination fraction that is applied to all cells in a channel.

The global model fits a simple Poisson glm to the aggregated count data across all cells.

Finally, note that if you are not able to find a reliable set of genes to use for contamination estimation, or you do not trust the values produced, the contamination fraction can be manually set by the user using setContaminationFraction.

Value

A modified version of sc with estimates of the contamination (rho) added to the metaData table.

Examples

#Common gene list in real world data
geneList = list(HB=c('HBB','HBA2'))
#Gene list appropriate to toy data
geneList = list(CD7 = 'CD7')
ute = estimateNonExpressingCells(scToy,geneList)
sc = calculateContaminationFraction(scToy,geneList,ute)

constantAmateur/SoupX documentation built on Nov. 2, 2022, 10:16 a.m.