Model the technical coefficient of variation

Description

Model the technical coefficient of variation as a function of the mean, and determine the significance of highly variable genes.

Usage

1
2
3
4
5
6
## S4 method for signature 'matrix'
technicalCV2(x, is.spike, sf.cell=NULL, sf.spike=NULL, 
    cv2.limit=0.3, cv2.tol=0.8, min.bio.disp=0.25) 

## S4 method for signature 'SCESet'
technicalCV2(x, spike.type=NULL, ..., assay="counts")

Arguments

x

A numeric matrix of counts, where each column corresponds to a cell and each row corresponds to a spike-in transcript. Alternatively, a SCESet object that contains such values.

is.spike

A vector indicating which rows of x correspond to spike-in transcripts.

sf.cell

A numeric vector containing size factors for endogenous genes.

sf.spike

A numeric vector containing size factors for spike-in transcripts.

cv2.limit, cv2.tol

Numeric scalars that determine the minimum mean abundance for the spike-in transcripts to be used for trend fitting.

min.bio.disp

A numeric scalar specifying the minimum biological dispersion.

spike.type

A character vector containing the names of the spike-in sets to use.

...

Additional arguments to pass to technicalCV2,matrix-method.

assay

A string specifying which assay values to use.

Details

This function will estimate the squared coefficient of variation (CV2) and mean for each spike-in transcript. A mean-dependent trend is fitted to the CV2 values for the transcripts using a Gamma GLM with glmgam.fit. Only high-abundance transcripts are used for stable trend fitting. (Specifically, a mean threshold is selected by taking all transcripts with CV2 above cv2.limit, and taking the quantile of this subset at cv2.tol. A warning will be thrown and all spike-ins will be used if the subset is empty.)

The trend is used to determine the technical CV2 for each endogenous gene based on its mean. To identify highly variable genes, the null hypothesis is that the total CV2 for each gene is less than or equal to the technical CV2 plus min.bio.disp. Deviations from the null are identified using a chi-squared test. The additional min.bio.disp is necessary for a ratio-based test, as otherwise genes with large relative (but small absolute) CV2 would be favoured.

For technicalCV2,matrix-method, the rows corresponding to spike-in transcripts are specified with is.spike. These rows will be used for trend fitting, while all other rows are treated as endogenous genes. If either sf.cell or sf.spike are not specified, the estimateSizeFactorsForMatrix function is applied to compute size factors.

For technicalCV2,SCESet-method, only the named spike-in sets in spike.type will be used for trend fitting. If spike.type=NULL, all spike-in sets listed in x will be used. Size factors for the endogenous genes are automatically extracted via sizeFactors. Spike-in-specific size factors for spike.type are extracted from x, if available; otherwise they are set to the size factors for the endogenous genes. Note that the spike-in-specific factors must be the same for each set in spike.type.

If is.spike or spike.type are NA, all rows will be used for trend fitting, and (adjusted) p-values will be reported for all rows. This should be used in cases where there are no spike-ins. Here, the assumption is that most endogenous genes do not exhibit high biological variability and thus can be used to model technical variation.

Value

A data frame is returned containing one row per row of x (including both endogenous genes and spike-in transcripts). Each row contains the following information:

mean:

A numeric field, containing mean (scaled) counts for all genes and transcripts.

var:

A numeric field, containing the variances for all genes and transcripts.

cv2:

A numeric field, containing CV2 values for all genes and transcripts.

trend:

A numeric field, containing the fitted value of the trend in the CV2 values. Note that the fitted value is reported for all genes and transcripts, but the trend is only fitted using the transcripts.

p.value:

A numeric field, containing p-values for all endogenous genes (NA for rows corresponding to spike-in transcripts).

FDR:

A numeric field, containing adjusted p-values for all genes.

Author(s)

Aaron Lun, based on code from Brennecke et al. (2013)

References

Brennecke P, Anders S, Kim JK et al. (2013). Accounting for technical noise in single-cell RNA-seq experiments. Nat. Methods 10:1093-95

See Also

glmgam.fit, estimateSizeFactorsForMatrix

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
ngenes <- 10000
means <- 2^runif(ngenes, 6, 10)
dispersions <- 10/means + 0.2
nsamples <- 50
counts <- matrix(rnbinom(ngenes*nsamples, mu=means, size=1/dispersions), ncol=nsamples)
is.spike <- logical(ngenes)
is.spike[seq_len(500)] <- TRUE

out <- technicalCV2(counts, is.spike)
head(out)
plot(out$mean, out$cv2, log="xy")
points(out$mean, out$trend, col="red", pch=16, cex=0.5)

# Same again with an SCESet
rownames(counts) <- paste0("X", seq_len(ngenes))
colnames(counts) <- paste0("Y", seq_len(nsamples))
X <- newSCESet(countData=counts)
X <- calculateQCMetrics(X, list(Spikes=is.spike))
isSpike(X) <- "Spikes"

# Dummying up some size factors.
sizeFactors(X) <- 1
sizeFactors(X, type="Spikes") <- 1

out <- technicalCV2(X, spike.type="Spikes")
head(out)