autoEstCont: Automatically calculate the contamination fraction
In constantAmateur/SoupX: Single Cell mRNA Soup eXterminator

autoEstCont

R Documentation

Automatically calculate the contamination fraction

Description

The idea of this method is that genes that are highly expressed in the soup and are marker genes for some population can be used to estimate the background contamination. Marker genes are identified using the tfidf method (see quickMarkers). The contamination fraction is then calculated at the cluster level for each of these genes and clusters are then aggressively pruned to remove those that give implausible estimates.

Usage

autoEstCont(
  sc,
  topMarkers = NULL,
  tfidfMin = 1,
  soupQuantile = 0.9,
  maxMarkers = 100,
  contaminationRange = c(0.01, 0.8),
  rhoMaxFDR = 0.2,
  priorRho = 0.05,
  priorRhoStdDev = 0.1,
  doPlot = TRUE,
  forceAccept = FALSE,
  verbose = TRUE
)

Arguments

`sc`	The SoupChannel object.
`topMarkers`	A data.frame giving marker genes. Must be sorted by decreasing specificity of marker and include a column 'gene' that contains the gene name. If set to NULL, markers are estimated using `quickMarkers`.
`tfidfMin`	Minimum value of tfidf to accept for a marker gene.
`soupQuantile`	Only use genes that are at or above this expression quantile in the soup. This prevents inaccurate estimates due to using genes with poorly constrained contribution to the background.
`maxMarkers`	If we have heaps of good markers, keep only the best `maxMarkers` of them.
`contaminationRange`	Vector of length 2 that constrains the contamination fraction to lie within this range. Must be between 0 and 1. The high end of this range is passed to `estimateNonExpressingCells` as `maximumContamination`.
`rhoMaxFDR`	False discovery rate passed to `estimateNonExpressingCells`, to test if rho is less than `maximumContamination`.
`priorRho`	Mode of gamma distribution prior on contamination fraction.
`priorRhoStdDev`	Standard deviation of gamma distribution prior on contamination fraction.
`doPlot`	Create a plot showing the density of estimates?
`forceAccept`	Passed to `setContaminationFraction`. Should we allow very high contamination fractions to be used.
`verbose`	Be verbose?

Details

This set of marker genes is filtered to include only those with tf-idf value greater than tfidfMin. A higher tf-idf value implies a more specific marker. Specifically a cut-off t implies that a marker gene has the property that geneFreqGlobal < exp(-t/geneFreqInClust). See quickMarkers. It may be necessary to decrease this value for data sets with few good markers.

This set of marker genes is filtered down to include only the genes that are highly expressed in the soup, controlled by the soupQuantile parameter. Genes highly expressed in the soup provide a more precise estimate of the contamination fraction.

The pruning of implausible clusters is based on a call to estimateNonExpressingCells. The parameters maximumContamination=max(contaminationRange) and rhoMaxFDR are passed to this function. The defaults set here are calibrated to aggressively prune anything that has even the weakest of evidence that it is genuinely expressed.

For each cluster/gene pair the posterior distribution of the contamination fraction is calculated (based on gamma prior, controlled by priorRho and priorRhoStdDev). These posterior distributions are aggregated to produce a final estimate of the contamination fraction. The logic behind this is that estimates from clusters that truly estimate the contamination fraction will cluster around the true value, while erroneous estimates will be spread out across the range (0,1) without a 'preferred value'. The most probable value of the contamination fraction is then taken as the final global contamination fraction.

Value

A modified SoupChannel object where the global contamination rate has been set. Information about the estimation is also stored in the slot fit

Note

This function assumes that the channel contains multiple distinct cell types with different marker genes. If you try and run it on a channel with very homogenous cells (e.g. a cell line, flow-sorted cells), you will likely get a warning, an error, and/or an extremely high contamination estimate. In such circumstances your best option is usually to manually set the contamination to something reasonable.

Examples

#Use less specific markers
scToy = autoEstCont(scToy,tfidfMin=0.8)
#Allow large contamination fractions to be allocated
scToy = autoEstCont(scToy,forceAccept=TRUE)
#Be quiet
scToy = autoEstCont(scToy,verbose=FALSE,doPlot=FALSE)

constantAmateur/SoupX documentation built on Nov. 2, 2022, 10:16 a.m.