pairwiseTTests | R Documentation |
Perform pairwise Welch t-tests between groups of cells, possibly after blocking on uninteresting factors of variation.
pairwiseTTests(x, ...)
## S4 method for signature 'ANY'
pairwiseTTests(
x,
groups,
block = NULL,
design = NULL,
restrict = NULL,
exclude = NULL,
direction = c("any", "up", "down"),
lfc = 0,
std.lfc = FALSE,
log.p = FALSE,
gene.names = NULL,
subset.row = NULL,
BPPARAM = SerialParam()
)
## S4 method for signature 'SummarizedExperiment'
pairwiseTTests(x, ..., assay.type = "logcounts")
## S4 method for signature 'SingleCellExperiment'
pairwiseTTests(x, groups = colLabels(x, onAbsence = "error"), ...)
x |
A numeric matrix-like object of normalized log-expression values, where each column corresponds to a cell and each row corresponds to an endogenous gene. Alternatively, a SummarizedExperiment or SingleCellExperiment object containing such a matrix. |
... |
For the generic, further arguments to pass to specific methods. For the SummarizedExperiment method, further arguments to pass to the ANY method. For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method. |
groups |
A vector of length equal to |
block |
A factor specifying the blocking level for each cell. |
design |
A numeric matrix containing blocking terms for uninteresting factors.
Note that these factors should not be confounded with |
restrict |
A vector specifying the levels of |
exclude |
A vector specifying the levels of |
direction |
A string specifying the direction of log-fold changes to be considered in the alternative hypothesis. |
lfc |
A positive numeric scalar specifying the log-fold change threshold to be tested against. |
std.lfc |
A logical scalar indicating whether log-fold changes should be standardized. |
log.p |
A logical scalar indicating if log-transformed p-values/FDRs should be returned. |
gene.names |
Deprecated, use |
subset.row |
See |
BPPARAM |
A BiocParallelParam object indicating whether and how parallelization should be performed across genes. |
assay.type |
A string specifying which assay values to use, usually |
This function performs t-tests to identify differentially expressed genes (DEGs) between pairs of groups of cells.
A typical aim is to use the DEGs to determine cluster identity based on expression of marker genes with known biological activity.
A list of tables is returned where each table contains per-gene statistics for a comparison between one pair of groups.
This can be examined directly or used as input to combineMarkers
for marker gene detection.
We use t-tests as they are simple, fast and perform reasonably well for single-cell data (Soneson and Robinson, 2018). However, if one of the groups contains fewer than two cells, no p-value will be reported for comparisons involving that group. A warning will also be raised about insufficient degrees of freedom (d.f.) in such cases.
When log.p=TRUE
, the log-transformed p-values and FDRs are reported using the natural base.
This is useful in cases with many cells such that reporting the p-values directly would lead to double-precision underflow.
If restrict
is specified, comparisons are only performed between pairs of groups in restrict
.
This can be used to focus on DEGs distinguishing between a subset of the groups (e.g., closely related cell subtypes).
If exclude
is specified, comparisons are not performed between groups in exclude
.
Similarly, if any entries of groups
are NA
, the corresponding cells are are ignored.
A list is returned containing statistics
and pairs
.
The statistics
element is itself a list of DataFrames.
Each DataFrame contains the statistics for a comparison between a pair of groups,
including the log-fold changes, p-values and false discovery rates.
The pairs
element is a DataFrame where each row corresponds to an entry of statistics
.
This contains the fields first
and second
,
specifying the two groups under comparison in the corresponding DataFrame in statistics
.
In each DataFrame in statistics
, the log-fold change represents the change in the first
group compared to the second
group.
Log-fold changes are reported as differences in the values of x
.
Thus, all log-fold changes have the same base as whatever was used to perform the log-transformation in x
.
If logNormCounts
was used, this would be base 2.
If direction="any"
, two-sided tests will be performed for each pairwise comparisons between groups.
Otherwise, one-sided tests in the specified direction will be used instead.
This can be used to focus on genes that are upregulated in each group of interest, which is often easier to interpret when assigning cell type to a cluster.
To interpret the setting of direction
, consider the DataFrame for group X, in which we are comparing to another group Y.
If direction="up"
, genes will only be significant in this DataFrame if they are upregulated in group X compared to Y.
If direction="down"
, genes will only be significant if they are downregulated in group X compared to Y.
The magnitude of the log-fold changes can also be tested by setting lfc
.
By default, lfc=0
meaning that we will reject the null upon detecting any differential expression.
If this is set to some other positive value, the null hypothesis will change depending on direction
:
If direction="any"
, the null hypothesis is that the true log-fold change lies inside [-lfc, lfc]
.
To be conservative, we perform one-sided tests against the boundaries of this interval, and combine them to obtain a two-sided p-value.
If direction="up"
, the null hypothesis is that the true log-fold change is less than or equal to lfc
.
A one-sided p-value is computed against the boundary of this interval.
If direction="down"
, the null hypothesis is that the true log-fold change is greater than or equal to -lfc
.
A one-sided p-value is computed against the boundary of this interval.
This is similar to the approach used in treat
and allows users to focus on genes with strong log-fold changes.
If std.lfc=TRUE
, the log-fold change for each gene is standardized by the variance.
When the Welch t-test is being used, this is equivalent to Cohen's d.
Standardized log-fold changes may be more appealing for visualization as it avoids large fold changes due to large variance.
The choice of std.lfc
does not affect the calculation of the p-values.
If block
is specified, Welch t-tests are performed between groups within each level of block
.
For each pair of groups, the p-values for each gene across all levels of block
are combined using Stouffer's weighted Z-score method.
The reported log-fold change for each gene is also a weighted average of log-fold changes across levels.
The weight for a particular level is defined as (1/N_x + 1/N_y)^{-1}
,
where Nx
and Ny
are the number of cells in groups X and Y, respectively, for that level.
This is inversely proportional to the expected variance of the log-fold change, provided that all groups and blocking levels have the same variance.
When comparing two groups, blocking levels are ignored if no p-value was reported, e.g., if there were insufficient cells for a group in a particular level.
This includes levels that contain fewer than two cells for either group, as this cannot yield a p-value from the Welch t-test.
If all levels are ignored in this manner, the entire comparison will only contain NA
p-values and a warning will be emitted.
If design
is specified, a linear model is instead fitted to the expression profile for each gene.
This linear model will include the groups
as well as any blocking factors in design
.
A t-test is then performed to identify DEGs between pairs of groups, using the values of the relevant coefficients and the gene-wise residual variance.
Note that design
must be full rank when combined with the groups
terms,
i.e., there should not be any confounding variables.
We make an exception for the common situation where design
contains an "(Intercept)"
column,
which is automatically detected and removed (emitting a warning along the way).
We recommend using block
instead of design
for uninteresting categorical factors of variation.
The former accommodates differences in the variance of expression in each group via Welch's t-test.
As a result, it is more robust to misspecification of the groups, as misspecified groups (and inflated variances) do not affect the inferences for other groups.
Use of block
also avoids assuming additivity of effects between the blocking factors and the groups.
Nonetheless, use of design
is unavoidable when blocking on real-valued covariates.
It is also useful for ensuring that log-fold changes/p-values are computed for comparisons between all pairs of groups
(assuming that design
is not confounded with the groups).
This may not be the case with block
if a pair of groups never co-occur in a single blocking level.
Aaron Lun
Whitlock MC (2005). Combining probability from independent tests: the weighted Z-method is superior to Fisher's approach. J. Evol. Biol. 18, 5:1368-73.
Soneson C and Robinson MD (2018). Bias, robustness and scalability in single-cell differential expression analysis. Nat. Methods
Lun ATL (2018). Comments on marker detection in scran. https://ltla.github.io/SingleCellThoughts/software/marker_detection/comments.html
t.test
, on which this function is based.
combineMarkers
, to combine pairwise comparisons into a single DataFrame per group.
getTopMarkers
, to obtain the top markers from each pairwise comparison.
library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)
# Any clustering method is okay.
kout <- kmeans(t(logcounts(sce)), centers=3)
# Vanilla application:
out <- pairwiseTTests(logcounts(sce), groups=kout$cluster)
out
# Directional with log-fold change threshold:
out <- pairwiseTTests(logcounts(sce), groups=kout$cluster,
direction="up", lfc=0.2)
out
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.