findTopCorrelations: Find top correlations between features
In LTLA/mumosa: Multi-Modal Single-Cell Analysis Methods

findTopCorrelations

R Documentation

Find top correlations between features

Description

For each feature, find the subset of other features in the same or another modality that have strongest positive/negative Spearman's rank correlations in a pair of normalized expression matrices.

Usage

findTopCorrelations(x, number, ...)

## S4 method for signature 'ANY'
findTopCorrelations(
  x,
  number = 10,
  y = NULL,
  d = 50,
  direction = c("both", "positive", "negative"),
  subset.cols = NULL,
  block = NULL,
  equiweight = TRUE,
  use.names = TRUE,
  deferred = TRUE,
  BSPARAM = IrlbaParam(),
  BNPARAM = KmknnParam(),
  BPPARAM = SerialParam()
)

## S4 method for signature 'SummarizedExperiment'
findTopCorrelations(
  x,
  number,
  y = NULL,
  use.names = TRUE,
  ...,
  assay.type = "logcounts"
)

Arguments

`x`, `y`	Normalized expression matrices containing features in the rows and cells in the columns. Each matrix should have the same set of columns but a different set of features, usually corresponding to different modes for the same cells. Alternatively, SummarizedExperiment objects containing such a matrix. Finally, `y` may be `NULL`, in which correlations are computed between features in `x`.
`number`	Integer scalar specifying the number of top correlated features to report for each feature in `x`.
`...`	For the generic, further arguments to pass to specific methods. For the SummarizedExperiment method, further arguments to pass to the ANY method.
`d`	Integer scalar specifying the number of dimensions to use for the approximate search via PCA. If `NA`, no approximation of the rank values is performed prior to the search.
`direction`	String specifying the sign of the correlations to search for.
`subset.cols`	Vector indicating the columns of `x` (and `y`) to retain for computing correlations.
`block`	A vector or factor of length equal to the number of cells, specifying the block of origin for each cell.
`equiweight`	Logical scalar indicating whether each block should be given equal weight, if `block` is specified. If `FALSE`, each block is weighted by the number of cells.
`use.names`	Logical scalar specifying whether row names of `x` and/or `y` should be reported in the output, if available. For the SummarizedExperiment method, this may also be a string specifying the `rowData` column containing the names to use; or a character vector of length 2, where the first and second entries specify the `rowData` columns containing the names in `x` and `y` respectively. If either entry is `NA`, the existing row names for the corresponding object are used. Note that this only has an effect on `y` if it is a SummarizedExperiment.
`deferred`	Logical scalar indicating whether a fast deferred calculation should be used for the rank-based PCA.
`BSPARAM`	A BiocSingularParam object specifying the algorithm to use for the PCA.
`BNPARAM`	A BiocNeighborParam object specifying the algorithm to use for the neighbor search.
`BPPARAM`	A BiocParallelParam object specifying the parallelization scheme to use.
`assay.type`	String or integer scalar specifying the assay containing the matrix of interest in `x` (and `y`, if a SummarizedExperiment).

Details

In most cases, we only care about the top-correlated features, allowing us to skip a lot of unnecessary computation. This is achieved by transforming the problem of finding the largest Spearman correlation into a nearest-neighbor search in rank space. For the sake of speed, we approximate the search by performing PCA to compress the rank values for all features.

For each direction, we compute the one-sided p-value for each feature using the approximate method implemented in cor.test. The FDR correction is performed by considering all possible pairs of features, as these are implicitly tested in the neighbor search. Note that this is somewhat conservative as it does not consider strong correlations outside the reported features.

If block is specified, correlations are computed separately for each block of cells. For each feature pair, the reported rho is set to the average of the correlations across all blocks. Similarly, the p-value corresponding to each correlation is computed separately for each block and then combined across blocks with Stouffer's method. If equiweight=FALSE, the average correlation and each per-block p-value is weighted by the number of cells.

We only consider pairs of features that have computable correlations in at least one block. Blocks are ignored if one or the other feature has tied values (typically zeros) for all cells in that block. This means that a feature may not have any entries in feature1 if it forms no valid pairs, e.g., because it is not expressed. Similarly, the total number of rows may be less than the maximum if insufficient valid pairs are available.

Value

A List containing one or two DataFrames for results in each direction. These are named "positive" and "negative", and are generated according to direction; if direction="both", both DataFrames will be present.

Each DataFrame has up to nrow(x) * number rows, containing the top number correlated features for each feature in x. This contains the following fields:

feature1, the name (character) or row index (integer) of each feature in x. Not all features may be reported here, see Details.
feature2, the name (character) or row index (integer) of one of the top correlated features to feature1. This is another feature in x if y=NULL, otherwise it is a feature in y.
rho, the Spearman rank correlation for the current pair of feature1 and feature2.
p.value, the approximate p-value associated with rho under the null hypothesis that the correlation is zero.
FDR, the adjusted p-value.

The rows are sorted by feature1 and then p.value.

Author(s)

Aaron Lun

Examples

library(scuttle)
sce1 <- mockSCE()
sce1 <- logNormCounts(sce1)

sce2 <- mockSCE(ngenes=20) # pretend this is CITE-seq data, or something.
sce2 <- logNormCounts(sce2)

# Top 20 correlated features in 'sce2' for each feature in 'sce1':
df <- findTopCorrelations(sce1, sce2, number=20) 
df

LTLA/mumosa documentation built on Oct. 1, 2024, 8:47 a.m.