pseudoBulkSpecific: Label-specific pseudo-bulk DE
In scran: Methods for Single-Cell RNA-Seq Data Analysis

Description Usage Arguments Details Value Computing the average Author(s) See Also Examples

Detect label-specific DE genes in a pseudo-bulk analysis, by testing whether the log-fold change is more extreme than the average log-fold change of other labels.

pseudoBulkSpecific(x, ...)

## S4 method for signature 'ANY'
pseudoBulkSpecific(
  x,
  label,
  condition = NULL,
  ...,
  method = c("edgeR", "voom"),
  sorted = FALSE,
  average = c("median", "mean"),
  missing.as.zero = FALSE,
  reference = NULL
)

## S4 method for signature 'SummarizedExperiment'
pseudoBulkSpecific(x, ..., assay.type = 1)

`x`	A numeric matrix of counts where rows are genes and columns are pseudo-bulk profiles. Alternatively, a SummarizedExperiment object containing such a matrix in its assays.
`...`	For the generic, further arguments to pass to individual methods. For the ANY method, further arguments to pass to `pseudoBulkDGE`. For the SummarizedExperiment method, further arguments to pass to the ANY method.
`label`	A vector of factor of length equal to `ncol(x)`, specifying the cluster or cell type assignment for each column of `x`.
`condition`	A vector or factor of length equal to `ncol(x)`, specifying the experimental condition for each column of `x`. Only used for abundance-based filtering of genes.
`method`	String specifying the DE analysis framework to use.
`sorted`	Logical scalar indicating whether the output tables should be sorted by p-value.
`average`	String specifying the method to compute the average log-fold change of all other labels.
`missing.as.zero`	Logical scalar indicating whether missing log-fold changes should be set to zero.
`reference`	A List containing the (unsorted) output of `pseudoBulkDGE`. This can be supplied to avoid redundant calculations but is automatically computed if `NULL`.
`assay.type`	String or integer scalar specifying the assay to use from `x`.

This function implements a quick and dirty method for detecting label-specific DE genes. For a given label and gene, the null hypothesis is that the log-fold change lies between zero and the average log-fold change for that gene across all other labels. Genes that reject this null either have log-fold changes in the opposite sign or are significantly more extreme than the average.

To implement this, we test each gene against the two extremes and taking the larger of the two p-values. The p-value is set to 1 if the log-fold change lies between the extremes. This is somewhat similar to how treat might behave if the null interval was not centered at zero; however, our approach is more conservative than the treat as the p-value calculations are not quite correct.

It is worth stressing that there are no guarantees that the DE genes detected in this manner are truly label-specific. For any label and DEG, there may be one or more labels with stronger log-fold changes, but the average may be pulled towards zero by other labels with weaker or opposing effects. The use of the average is analogous to recommendations in the edgeR user's guide for testing against multiple groups. However, a more stringent selection can be achieved by applying gates on decideTestsPerLabel.

Note that, if lfc is specified in the arguments to pseudoBulkDGE, the null interval is expanded in both directions by the specified value.

A List of DataFrames where each DataFrame contains DE statistics for one label. This is equivalent to the output of pseudoBulkDGE; if reference is supplied, most of the statistics will be identical to those reported there.

The main differences are that the p-values and FDR values are changed. Each DataFrame also has an OtherAverage field containing the average log-fold change across all other labels.

The average log-fold change for each gene is computed by taking the median or mean (depending on average) of the corresponding log-fold changes in each of the DE analyses for the other labels. We use the median by default as this means that at least half of all other labels should have weaker or opposing effects.

By default, low-abundance genes that were filtered out in a comparison do not contribute to the average. Any log-fold changes that could be computed from them are considered to be too unstable. If the gene is filtered out in all other labels, the average is set to zero for testing but is reported as NA.

If missing.as.zero=TRUE, the log-fold changes for all filtered genes are set to zero. This is useful when a gene is only expressed in the subset of labels and is consistently DEG in each comparison of the subset. Testing against the average computed from only those labels in the subset would fail to detect this DEG as subset-specific.

Aaron Lun

pseudoBulkDGE, for the underlying function that does all the heavy lifting.

set.seed(10000)
library(scuttle)
sce <- mockSCE(ncells=1000)
sce$samples <- gl(8, 125) # Pretending we have 8 samples.

# Making up some clusters.
sce <- logNormCounts(sce)
clusters <- kmeans(t(logcounts(sce)), centers=3)$cluster

# Creating a set of pseudo-bulk profiles:
info <- DataFrame(sample=sce$samples, cluster=clusters)
pseudo <- sumCountsAcrossCells(sce, info)

# Making up an experimental design for our 8 samples
# and adding a common DEG across all labels.
pseudo$DRUG <- gl(2,4)[pseudo$sample]
assay(pseudo)[1,pseudo$DRUG==1] <- assay(pseudo)[1,pseudo$DRUG==1] * 10 

# Label-specific analysis (note behavior of the first gene).
out <- pseudoBulkSpecific(pseudo, 
   label=pseudo$cluster,
   condition=pseudo$DRUG,
   design=~DRUG,
   coef="DRUG2"
)

out[[1]]