correlatePairs: Test for significant correlations

correlatePairsR Documentation

Test for significant correlations

Description

Identify pairs of genes that are significantly correlated in their expression profiles, based on Spearman's rank correlation.

Usage

correlatePairs(x, ...)

## S4 method for signature 'ANY'
correlatePairs(
  x,
  null.dist = NULL,
  ties.method = NULL,
  iters = NULL,
  block = NULL,
  design = NULL,
  equiweight = TRUE,
  use.names = TRUE,
  subset.row = NULL,
  pairings = NULL,
  BPPARAM = SerialParam()
)

## S4 method for signature 'SummarizedExperiment'
correlatePairs(x, ..., assay.type = "logcounts")

Arguments

x

A numeric matrix-like object of log-normalized expression values, where rows are genes and columns are cells. Alternatively, a SummarizedExperiment object containing such a matrix.

...

For the generic, additional arguments to pass to specific methods.

For the SummarizedExperiment method, additional methods to pass to the ANY method.

null.dist, ties.method, iters

Deprecated arguments, ignored.

block

A factor specifying the blocking level for each cell in x. If specified, correlations are computed separately within each block and statistics are combined across blocks.

design

A numeric design matrix containing uninteresting factors to be ignored.

equiweight

A logical scalar indicating whether statistics from each block should be given equal weight. Otherwise, each block is weighted according to its number of cells. Only used if block is specified.

use.names

A logical scalar specifying whether the row names of x should be used in the output. Alternatively, a character vector containing the names to use.

subset.row

See ?"scran-gene-selection".

pairings

A NULL value indicating that all pairwise correlations should be computed; or a list of 2 vectors of genes between which correlations are to be computed; or a integer/character matrix with 2 columns of specific gene pairs - see below for details.

BPPARAM

A BiocParallelParam object that specifies the manner of parallel processing to use.

assay.type

A string specifying which assay values to use.

Details

The correlatePairs function identifies significant correlations between all pairs of genes in x. This allows prioritization of genes that are driving systematic substructure in the data set. By definition, such genes should be correlated as they are behaving in the same manner across cells. In contrast, genes driven by random noise should not exhibit any correlations with other genes.

We use Spearman's rho to quantify correlations robustly based on ranked gene expression. To identify correlated gene pairs, the significance of non-zero correlations is assessed using rhoToPValue. The null hypothesis is that the ranking of normalized expression across cells should be independent between genes. Correction for multiple testing is done using the BH method.

For the SingleCellExperiment method, normalized expression values should be specified by assay.type.

Value

A DataFrame is returned with one row per gene pair and the following fields:

gene1, gene2:

Character or integer fields specifying the genes in the pair. If use.names=FALSE, integers are returned representing row indices of x, otherwise gene names are returned.

rho:

A numeric field containing the approximate Spearman's rho.

p.value, FDR:

Numeric fields containing the approximate p-value and its BH-corrected equivalent.

Rows are sorted by increasing p.value and, if tied, decreasing absolute size of rho. The exception is if subset.row is a matrix, in which case each row in the dataframe correspond to a row of subset.row.

Accounting for uninteresting variation

If the experiment has known (and uninteresting) factors of variation, these can be included in design or block. correlatePairs will then attempt to ensure that these factors do not drive strong correlations between genes. Examples might be to block on batch effects or cell cycle phase, which may have substantial but uninteresting effects on expression.

The approach used to remove these factors depends on whether design or block is used. If there is only one factor, e.g., for plate or animal of origin, block should be used. Each level of the factor is defined as a separate group of cells. For each pair of genes, correlations are computed within each group, and a mean of the correlations is taken across all groups. If equiweight=FALSE, a weighted mean is computed based on the size of each group.

Similarly, parallelStouffer is used to combine the (one-sided) p-values across all groups. This is done for each direction and a final p-value is computed for each gene pair using this Bonferri method. The idea is to ensure that the final p-value is only low when correlations are in the same direction across groups. If equiweight=FALSE, each p-value is weighted by the size of the corresponding group.

For experiments containing multiple factors or covariates, a design matrix should be passed into design. The correlation between each pair of genes is then computed from the residuals of the fitted model. However, we recommend using block wherever possible as design assumes normality of errors and deals poorly with ties. Specifically, zero counts within or across groups may no longer be tied when converted to residuals, potentially resulting in spuriously large correlations.

If any level of block has fewer than 3 cells, it is ignored. If all levels of block have fewer than 3 cells, all output statistics are set to NA. Similarly, if design has fewer than 3 residual d.f., all output statistics are set to NA.

Gene selection

The pairings argument specifies the pairs of genes that should be used to compute correlations. This can be:

  • NULL, in which case correlations will be computed between all pairs of genes in x. Genes that occur earlier in x are labelled as gene1 in the output DataFrame. Redundant permutations are not reported.

  • A list of two vectors, where each list element defines a subset of genes in x as an integer, character or logical vector. In this case, correlations will be computed between one gene in the first vector and another gene in the second vector. This improves efficiency if the only correlations of interest are those between two pre-defined sets of genes. Genes in the first vector are always reported as gene1.

  • An integer/character matrix of two columns. In this case, each row is assumed to specify a gene pair based on the row indices (integer) or row names (character) of x. Correlations will then be computed for only those gene pairs, and the returned dataframe will not be sorted by p-value. Genes in the first column of the matrix are always reported as gene1.

If subset.row is not NULL, only the genes in the selected subset are used to compute correlations - see ?"scran-gene-selection". This will interact properly with pairings, such that genes in pairings and not in subset.row will be ignored.

We recommend setting subset.row and/or pairings to contain only the subset of genes of interest. This reduces computational time and memory usage by only computing statistics for the gene pairs of interest. For example, we could select only HVGs to focus on genes contributing to cell-to-cell heterogeneity (and thus more likely to be involved in driving substructure). There is no need to account for HVG pre-selection in multiple testing, because rank correlations are unaffected by the variance.

Author(s)

Aaron Lun

References

Lun ATL (2019). Some thoughts on testing for correlations. https://ltla.github.io/SingleCellThoughts/software/correlations/corsim.html

See Also

Compare to cor for the standard Spearman's calculation.

Use correlateGenes to get per-gene correlation statistics.

Examples

library(scuttle)
sce <- mockSCE()
sce <- logNormCounts(sce)

# Basic pairwise application.
out <- correlatePairs(sce, subset.row=1:100)
head(out)

# Computing between specific subsets of genes:
out <- correlatePairs(sce, pairings=list(1:10, 110:120))
head(out)

# Computing between specific pairs:
out <- correlatePairs(sce, pairings=rbind(c(1,10), c(2, 50)))
head(out)


MarioniLab/scran documentation built on Sept. 7, 2024, 6:25 a.m.