scoreMarkers: Score marker genes

scoreMarkersR Documentation

Score marker genes

Description

Compute various summary scores for potential marker genes to distinguish between groups of cells.

Usage

scoreMarkers(x, ...)

## S4 method for signature 'ANY'
scoreMarkers(
  x,
  groups,
  block = NULL,
  pairings = NULL,
  lfc = 0,
  row.data = NULL,
  full.stats = FALSE,
  subset.row = NULL,
  BPPARAM = SerialParam()
)

## S4 method for signature 'SummarizedExperiment'
scoreMarkers(x, groups, ..., assay.type = "logcounts")

## S4 method for signature 'SingleCellExperiment'
scoreMarkers(x, groups = colLabels(x, onAbsence = "error"), ...)

Arguments

x

A matrix-like object containing log-normalized expression values, with genes in rows and cells in columns. Alternatively, a SummarizedExperiment object containing such a matrix in its assays.

...

For the generic, further arguments to pass to individual methods.

For the SummarizedExperiment method, further arguments to pass to the ANY method.

For the SingleCellExperiment method, further arguments to pass to the SummarizedExperiment method.

groups

A factor or vector containing the identity of the group for each cell in x.

block

A factor or vector specifying the blocking level for each cell in x.

pairings

A vector, list or matrix specifying how the comparisons are to be performed, see details.

lfc

Numeric scalar specifying the log-fold change threshold to compute effect sizes against.

row.data

A DataFrame with the same number and names of rows in x, containing extra information to insert into each DataFrame.

full.stats

Logical scalar indicating whether the statistics from the pairwise comparisons should be directly returned.

subset.row

See ?"scran-gene-selection".

BPPARAM

A BiocParallelParam object specifying how the calculations should be parallelized.

assay.type

String or integer scalar specifying the assay containing the log-expression matrix to use.

Details

Compared to findMarkers, this function represents a simpler and more intuitive summary of the differences between the groups. We do this by realizing that the p-values for these types of comparisons are largely meaningless; individual cells are not meaningful units of experimental replication, while the groups themselves are defined from the data. Thus, by discarding the p-values, we can simplify our marker selection by focusing only on the effect sizes between groups.

Here, the strategy is to perform pairwise comparisons between each pair of groups to obtain various effect sizes. For each group X, we summarize the effect sizes across all pairwise comparisons involving that group, e.g., mean, min, max and so on. This yields a DataFrame for each group where each column contains a different summarized effect and each row corresponds to a gene in x. Reordering the rows by the summary of choice can yield a ranking of potential marker genes for downstream analyses.

Value

A List of DataFrames containing marker scores for each gene in each group. Each DataFrame corresponds to a group and each row corresponds to a gene in x. See Details for information about the individual columns.

Choice of effect sizes

The logFC.cohen columns contain the standardized log-fold change, i.e., Cohen's d. For each pairwise comparison, this is defined as the difference in the mean log-expression for each group scaled by the average standard deviation across the two groups. (Technically, we should use the pooled variance; however, this introduces some unpleasant asymmetry depending on the variance of the larger group, so we take a simple average instead.) Cohen's d is analogous to the t-statistic in a two-sample t-test and avoids spuriously large effect sizes from comparisons between highly variable groups. We can also interpret Cohen's d as the number of standard deviations between the two group means.

The AUC columns contain the area under the curve. This is the probability that a randomly chosen observation in one group is greater than a randomly chosen observation in the other group. The AUC is closely related to the U-statistic used in the Wilcoxon rank sum test. Values greater than 0.5 indicate that a gene is upregulated in the first group.

The key difference between the AUC and Cohen's d is that the former is less sensitive to the variance within each group. The clearest example is that of two distributions that exhibit no overlap, where the AUC is the same regardless of the variance of each distribution. This may or may not be desirable as it improves robustness to outliers but reduces the information available to obtain a highly resolved ranking. The most appropriate choice of effect size is left at the user's discretion.

Finally, the logFC.detected columns contain the log-fold change in the proportion of cells with detected (i.e., non-zero) expression between groups. This is specifically useful for detecting binary expression patterns, e.g., activation of an otherwise silent gene. Note that the non-zero status of the data is not altered by normalization, so differences in library size will not be removed when computing this metric. This effect is not necessarily problematic - users can interpret it as retaining information about the total RNA content, analogous to spike-in normalization.

Setting a log-fold change threshold

The default settings may yield highly ranked genes with large effect sizes but low log-fold changes if the variance is low (Cohen's d) or separation is clear (AUC). Such genes may not be particularly interesting as the actual change in expression is modest. Setting lfc allows us to focus on genes with large log-fold changes between groups, by simply shifting the “other” group's expression values by lfc before computing effect sizes.

When lfc is not zero, Cohen's d is generalized to the standardized difference between the observed log-fold change and lfc. For example, if we had lfc=2 and we obtained a Cohen's d of 3, this means that the observed log-fold change was 3 standard deviations above a value of 2. A side effect is that we can only unambiguously interpret the direction of Cohen's d when it has the same sign as lfc. Our above example represents upregulation, but if our Cohen's d was negative, this could either mean downregulation or simply that our observed log-fold change was less than lfc.

When lfc is not zero, the AUC is generalized to the probability of obtaining a random observation in one group that is greater than a random observation plus lfc in the other group. For example, if we had lfc=2 and we obtained an AUC of 0.8, this means that we would observe a difference of lfc or greater between the random observations. Again, we can only unambiguously interpret the direction of the change when it is the same as the sign of the lfc. In this case, an AUC above 0.5 with a positive lfc represents upregulation, but an AUC below 0.5 could mean either downregulation or a log-fold change less than lfc.

A non-zero setting of lfc has no effect on the log-fold change in the proportion of cells with detected expression.

Computing effect size summaries

To simplify interpretation, we summarize the effect sizes across all pairwise comparisons into a few key metrics. For each group X, we consider the effect sizes from all pairwise comparisons between X and other groups. We then compute the following values:

  • mean.*, the mean effect size across all pairwise comparisons involving X. A large value (>0 for log-fold changes, >0.5 for the AUCs) indicates that the gene is upregulated in X compared to the average of the other groups. A small value (<0 for the log-fold changes, <0.5 for the AUCs) indicates that the gene is downregulated in X instead.

  • median.*, the median effect size across all pairwise comparisons involving X. A large value indicates that the gene is upregulated in X compared to most (>50%) other groups. A small value indicates that the gene is downregulated in X instead.

  • min.*, the minimum effect size across all pairwise comparisons involving X. A large value indicates that the gene is upregulated in X compared to all other groups. A small value indicates that the gene is downregulated in X compared to at least one other group.

  • max.*, the maximum effect size across all pairwise comparisons involving X. A large value indicates that the gene is upregulated in X compared to at least one other group. A small value indicates that the gene is downregulated in X compared to all other groups.

  • rank.*, the minimum rank (i.e., “min-rank”) across all pairwise comparisons involving X - see ?computeMinRank for details. A small min-rank indicates that the gene is one of the top upregulated genes in at least one comparison to another group.

One set of these columns is added to the DataFrame for each effect size described above. For example, the mean column for the AUC would be mean.AUC. We can then reorder each group's DataFrame by our column of choice, depending on which summary and effect size we are interested in. For example, if we ranked by decreasing min.logFC.detected, we would be aiming for marker genes that exhibit strong binary increases in expression in X compared to all other groups.

If full.stats=TRUE, an extra full.* column is returned in the DataFrame. This contains a nested DataFrame with number of columns equal to the number of other groups. Each column contains the statistic from the comparison between X and the other group.

Keep in mind that the interpretations above also depend on the sign of lfc. The concept of a “large” summary statistic (>0 for Cohen's d, >0.5 for the AUCs) can only be interpreted as upregulation when lfc >= 0. Similarly, the concept of a “small” value (<0 for Cohen's d, <0.5 for the AUCs) cannot be interpreted as downregulation when lfc <= 0. For example, if lfc=1, a positive min.logFC.cohen can still be interpreted as upregulation in X compared to all other groups, but a negative max.logFC.cohen could not be interpreted as downregulation in X compared to all other groups.

Computing other descriptive statistics

We report the mean log-expression of all cells in X, as well as the grand mean of mean log-expression values for all other groups. This is purely descriptive; while it can be used to compute an overall log-fold change, ranking is best performed on one of the effect sizes described above. We also report the proportion of cells with detectable expression in X and the mean proportion for all other groups.

When block is specified, the reported mean for each group is computed via correctGroupSummary. Briefly, this involves fitting a linear model to remove the effect of the blocking factor from the per-group mean log-expression. The same is done for the detected proportion, except that the values are subjected to a logit transformation prior to the model fitting. In both cases, each group/block combination is weighted by its number of cells in the model.

Controlling the pairings

The pairings argument specifies the pairs of groups that should be compared. This can be:

  • NULL, in which case comparisons are performed between all groups in groups.

  • A vector of the same type as group, specifying a subset of groups of interest. We then perform all pairwise comparisons between groups in the subset.

  • A list of two vectors, each of the same type as group and specifying a subset of groups. Comparisons are performed between one group from the first vector and another group from the second vector.

  • A matrix of two columns of the same type as group. Each row is assumed to specify a pair of groups to be compared.

Effect sizes (and their summaries) are computed for only the pairwise comparisons specified by pairings. Similarly, the other.* values in X's DataFrame are computed using only the groups involved in pairwise comparisons with X. The default of pairings=NULL ensures that all groups are used and effect sizes for all pairwise comparisons are reported; however, this may not be the case for other values of pairings.

For list and matrix arguments, the first vector/column is treated as the first group in the effect size calculations. Statistics for each comparison will only show up in the DataFrame for the first group, i.e., a comparison between X and Y will have a valid full.AUC$Y field in X's DataFrame but not vice versa. If both directions are desired in the output, both of the corresponding permutations should be explicitly specified in pairings.

Author(s)

Aaron Lun

Examples

library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)

# Any clustering method is okay, only using k-means for convenience.
kout <- kmeans(t(logcounts(sce)), centers=4) 

out <- scoreMarkers(sce, groups=kout$cluster)
out

# Ranking by a metric of choice:
of.interest <- out[[1]]
of.interest[order(of.interest$mean.AUC, decreasing=TRUE),1:4]
of.interest[order(of.interest$rank.AUC),1:4]
of.interest[order(of.interest$median.logFC.cohen, decreasing=TRUE),1:4]
of.interest[order(of.interest$min.logFC.detected, decreasing=TRUE),1:4]


MarioniLab/scran documentation built on March 7, 2024, 1:45 p.m.