Fit a mean-dependent trend to the gene-specific variances in single-cell RNA-seq data.

1 2 3 4 5 6 7 8 |

`x` |
A numeric matrix of normalized expression values, where each column corresponds to a cell and each row corresponds to a spike-in transcript. Alternatively, a SCESet object that contains such values. |

`trend` |
A string indicating whether the trend should be polynomial or loess-based. |

`span, family, degree` |
Arguments to pass to |

`start` |
A list containing starting values for |

`design` |
A numeric matrix describing the systematic factors contributing to expression in each cell. |

`subset.row` |
A logical, integer or character scalar indicating the rows of |

`...` |
Additional arguments to pass to |

`assay` |
A string specifying which assay values to use, e.g., |

`use.spikes` |
A logical scalar specifying whether the trend should be fitted to variances for spike-in transcripts or endogenous genes. |

The strategy is to fit an abundance-dependent trend to the variance of the log-normalized expression for the spike-in transcripts, using `trendVar`

.
For SCESet objects, these expression values can be computed by `normalize`

after setting the size factors, e.g., with `computeSpikeFactors`

.
Log-transformed values are used as these are more robust to genes/transcripts with strong expression in only one or two outlier cells.

The mean and variance of the log-CPMs is calculated for each spike-in transcript, and a trend is fitted to the variance against the mean for all transcripts.
The fitted value of this trend represents technical variability due to sequencing, drop-outs during capture, etc.
Variance decomposition to biological and technical components for endogenous genes can then be performed later with `decomposeVar`

.

The design matrix can be set if there are factors that should be blocked, e.g., batch effects, known (and uninteresting) clusters. Otherwise, it will default to an all-ones matrix, effectively treating all cells as part of the same group.

A named list is returned, containing:

`mean`

:A numeric vector of mean log-CPMs for all spike-in transcripts.

`var`

:A numeric vector of the variances of log-CPMs for all spike-in transcripts.

`trend`

:A function that returns the fitted value of the trend at any mean log-CPM.

`design`

:A numeric matrix, containing the design matrix that was used.

By default, a robust loess curve is used for trend fitting via `loess`

.
This protects against genes with very large or very small variances.
Some experimentation with `span`

, `degree`

or `family`

may be required to obtain satisfactory results.
The fit is also dependent on the quality of the spike-ins – the fit will obviously be poor if the coverage of all spike-ins is low.

Alternatively, when `trend="semiloess"`

, a non-linear curve of the form

*y = ax/(x^n + b)*

is fitted to the variances against the means using `nls`

, and a loess curve is then fitted to the log-ratios of the observed to fitted values.
The parametric curve reduces the sharpness of the trend for easier loess fitting.
Conversely, the parametric form is not exact, so the loess curve models any remaining trends in the residuals.

In general, the semi-loess setting tends to give smoother curves than loess alone.
It is more robust to the uneven distribution of spike-in transcripts across the covariate range.
However, it tends to be susceptible to convergence issues, and may require some fiddling with the start values to converge properly.
By default, the start values are `a=5`

, `n=5`

and `b=1`

, which can be altered as named arguments in `start`

.

Spike-in transcripts can be selected in `trendVar,SCESet-method`

using the `use.spikes`

method.
By default, `use.spikes=TRUE`

which means that only rows labelled as spike-ins with `isSpike(x)`

will be used.

When spike-ins are not available, `trendVar`

can also be applied directly to the counts for endogenous genes by setting `use.spikes=FALSE`

(or by manually supplying a matrix of normalized expression for endogenous genes, for `trendVar,matrix-method`

).
This assumes that most genes exhibit technical variation and little biological variation, e.g., in a homogeneous population.

If `use.spikes=NA`

, every row will be used for trend fitting, regardless of whether it corresponds to a spike-in transcript or endogenous gene.
Users can also directly specify which rows to use with `subset.row`

.
This will override any setting of `use.spikes`

.

Aaron Lun

`nls`

,
`loess`

,
`decomposeVar`

,
`computeSpikeFactors`

,
`computeSumFactors`

,
`normalize`

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 | ```
set.seed(100)
nspikes <- ncells <- 200
spike.means <- 2^runif(nspikes, 3, 8)
spike.disp <- 100/spike.means + 0.5
spike.data <- matrix(rnbinom(nspikes*ncells, mu=spike.means, size=1/spike.disp), ncol=ncells)
ngenes <- 10000
cell.means <- 2^runif(ngenes, 2, 10)
cell.disp <- 100/cell.means + 0.5
cell.data <- matrix(rnbinom(ngenes*ncells, mu=cell.means, size=1/cell.disp), ncol=ncells)
combined <- rbind(cell.data, spike.data)
colnames(combined) <- seq_len(ncells)
rownames(combined) <- seq_len(nrow(combined))
y <- newSCESet(countData=combined)
y <- calculateQCMetrics(y, list(Spike=rep(c(FALSE, TRUE), c(ngenes, nspikes))))
isSpike(y) <- "Spike"
# Normalizing.
y <- computeSumFactors(y)
y <- computeSpikeFactors(y, general.use=FALSE)
y <- normalize(y)
# Fitting a trend to the spike-ins.
fit <- trendVar(y)
plot(fit$mean, fit$var)
x <- sort(fit$mean)
lines(x, fit$trend(x), col="red", lwd=2)
# Fitting a trend to the endogenous genes.
fit <- trendVar(y, use.spikes=FALSE)
plot(fit$mean, fit$var)
x <- sort(fit$mean)
lines(x, fit$trend(x), col="red", lwd=2)
``` |

Questions? Problems? Suggestions? Tweet to @rdrrHQ or email at ian@mutexlabs.com.

Please suggest features or report bugs with the GitHub issue tracker.

All documentation is copyright its authors; we didn't write any of that.