# Model the technical coefficient of variation

### Description

Model the technical coefficient of variation as a function of the mean, and determine the significance of highly variable genes.

### Usage

1 2 3 4 5 6 | ```
## S4 method for signature 'matrix'
technicalCV2(x, is.spike, sf.cell=NULL, sf.spike=NULL,
cv2.limit=0.3, cv2.tol=0.8, min.bio.disp=0.25)
## S4 method for signature 'SCESet'
technicalCV2(x, spike.type=NULL, ..., assay="counts")
``` |

### Arguments

`x` |
A numeric matrix of counts, where each column corresponds to a cell and each row corresponds to a spike-in transcript. Alternatively, a SCESet object that contains such values. |

`is.spike` |
A vector indicating which rows of |

`sf.cell` |
A numeric vector containing size factors for endogenous genes. |

`sf.spike` |
A numeric vector containing size factors for spike-in transcripts. |

`cv2.limit, cv2.tol` |
Numeric scalars that determine the minimum mean abundance for the spike-in transcripts to be used for trend fitting. |

`min.bio.disp` |
A numeric scalar specifying the minimum biological dispersion. |

`spike.type` |
A character vector containing the names of the spike-in sets to use. |

`...` |
Additional arguments to pass to |

`assay` |
A string specifying which assay values to use. |

### Details

This function will estimate the squared coefficient of variation (CV2) and mean for each spike-in transcript.
A mean-dependent trend is fitted to the CV2 values for the transcripts using a Gamma GLM with `glmgam.fit`

.
Only high-abundance transcripts are used for stable trend fitting.
(Specifically, a mean threshold is selected by taking all transcripts with CV2 above `cv2.limit`

, and taking the quantile of this subset at `cv2.tol`

.
A warning will be thrown and all spike-ins will be used if the subset is empty.)

The trend is used to determine the technical CV2 for each endogenous gene based on its mean.
To identify highly variable genes, the null hypothesis is that the total CV2 for each gene is less than or equal to the technical CV2 plus `min.bio.disp`

.
Deviations from the null are identified using a chi-squared test.
The additional `min.bio.disp`

is necessary for a ratio-based test, as otherwise genes with large relative (but small absolute) CV2 would be favoured.

For `technicalCV2,matrix-method`

, the rows corresponding to spike-in transcripts are specified with `is.spike`

.
These rows will be used for trend fitting, while all other rows are treated as endogenous genes.
If either `sf.cell`

or `sf.spike`

are not specified, the `estimateSizeFactorsForMatrix`

function is applied to compute size factors.

For `technicalCV2,SCESet-method`

, only the named spike-in sets in `spike.type`

will be used for trend fitting.
If `spike.type=NULL`

, all spike-in sets listed in `x`

will be used.
Size factors for the endogenous genes are automatically extracted via `sizeFactors`

.
Spike-in-specific size factors for `spike.type`

are extracted from `x`

, if available; otherwise they are set to the size factors for the endogenous genes.
Note that the spike-in-specific factors must be the same for each set in `spike.type`

.

If `is.spike`

or `spike.type`

are `NA`

, all rows will be used for trend fitting, and (adjusted) p-values will be reported for all rows.
This should be used in cases where there are no spike-ins.
Here, the assumption is that most endogenous genes do not exhibit high biological variability and thus can be used to model technical variation.

### Value

A data frame is returned containing one row per row of `x`

(including both endogenous genes and spike-in transcripts).
Each row contains the following information:

`mean`

:A numeric field, containing mean (scaled) counts for all genes and transcripts.

`var`

:A numeric field, containing the variances for all genes and transcripts.

`cv2`

:A numeric field, containing CV2 values for all genes and transcripts.

`trend`

:A numeric field, containing the fitted value of the trend in the CV2 values. Note that the fitted value is reported for all genes and transcripts, but the trend is only fitted using the transcripts.

`p.value`

:A numeric field, containing p-values for all endogenous genes (

`NA`

for rows corresponding to spike-in transcripts).`FDR`

:A numeric field, containing adjusted p-values for all genes.

### Author(s)

Aaron Lun, based on code from Brennecke et al. (2013)

### References

Brennecke P, Anders S, Kim JK et al. (2013).
Accounting for technical noise in single-cell RNA-seq experiments.
*Nat. Methods* 10:1093-95

### See Also

`glmgam.fit`

,
`estimateSizeFactorsForMatrix`

### Examples

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 | ```
ngenes <- 10000
means <- 2^runif(ngenes, 6, 10)
dispersions <- 10/means + 0.2
nsamples <- 50
counts <- matrix(rnbinom(ngenes*nsamples, mu=means, size=1/dispersions), ncol=nsamples)
is.spike <- logical(ngenes)
is.spike[seq_len(500)] <- TRUE
out <- technicalCV2(counts, is.spike)
head(out)
plot(out$mean, out$cv2, log="xy")
points(out$mean, out$trend, col="red", pch=16, cex=0.5)
# Same again with an SCESet
rownames(counts) <- paste0("X", seq_len(ngenes))
colnames(counts) <- paste0("Y", seq_len(nsamples))
X <- newSCESet(countData=counts)
X <- calculateQCMetrics(X, list(Spikes=is.spike))
isSpike(X) <- "Spikes"
# Dummying up some size factors.
sizeFactors(X) <- 1
sizeFactors(X, type="Spikes") <- 1
out <- technicalCV2(X, spike.type="Spikes")
head(out)
``` |