rescaleByNeighbors: Rescale matrices for different modes

rescaleByNeighborsR Documentation

Rescale matrices for different modes

Description

Rescale matrices for different data modalities so that their distances are more comparable, using the distances to neighbors to approximate noise.

Usage

rescaleByNeighbors(x, ...)

## S4 method for signature 'ANY'
rescaleByNeighbors(
  x,
  k = 50,
  weights = NULL,
  combine = TRUE,
  BNPARAM = KmknnParam(),
  BPPARAM = SerialParam()
)

## S4 method for signature 'SummarizedExperiment'
rescaleByNeighbors(x, assays, extras = list(), ...)

## S4 method for signature 'SingleCellExperiment'
rescaleByNeighbors(
  x,
  assays = NULL,
  dimreds = NULL,
  altexps = NULL,
  altexp.assay = "logcounts",
  extras = list(),
  ...
)

Arguments

x

A list of numeric matrices where each row is a cell and each column is some dimension/variable. For gene expression data, this is usually the matrix of PC coordinates. All matrices should have the same number of rows.

Alternatively, a SummarizedExperiment containing relevant matrices in its assays.

Alternatively, a SingleCellExperiment containing relevant matrices in its assays, reducedDims or altExps.

...

For the generic, further arguments to pass to specific methods.

For the SummarizedExperiment and SingleCellExperiment methods, further arguments to pass to the ANY method.

k

An integer scalar specifying the number of neighbors to use for the distance calculation.

weights

A numeric vector of length equal to x (if a list), specifying the weight of each mode. Defaults to equal weights for all modes. See details for how to interpret this argument when x is a SummarizedExperiment.

combine

A logical scalar specifying whether the rescaled matrices should be combined into a single matrix.

BNPARAM

A BiocNeighborParam object specifying the algorithm to use for the nearest-neighbor search.

BPPARAM

A BiocParallelParam object specifying the parallelization for the nearest-neighbor search.

assays

A character or integer vector of assays to extract and transpose for use in the ANY method. For the SingleCellExperiment, this argument can be missing, in which case no assays are used.

extras

A list of further matrices of similar structure to those matrices in a list-like x.

dimreds

A character or integer vector of reducedDims to extract for use in the ANY method. This argument can be missing, in which case no assays are used.

altexps

A character or integer vector of altExps to extract and transpose for use in the ANY method. This argument can be missing, in which case no alternative experiments are used.

altexp.assay

A character or integer vector specifying the assay to extract from alternative experiments, when altexp is specified. This is recycled to the same length as altexp.

Details

When dealing with multi-modal data, we may wish to combine all modes into a single matrix for downstream processing. However, a naive cbind does not account for the fact that different modes may very different scales and number of features. A mode with a larger scale or more features may dominate steps such as clustering or dimensionality reduction. This function attempts to rescale the contents for each matrix so that the modes are more comparable.

A naive approach to rescaling would be to just equalize the total variances across matrices. This is not ideal as it fails to consider the differences in biological variation captured by each mode. For example, if a biological phenomenon is only present in one mode, that matrix's total variance would naturally be higher. Scaling all matrices to the same total variance would suppress genuine variation and inflate the relative contribution of noise.

We instead use the distance to the kth nearest neighbor as an estimate of the per-mode “noise”. Modes with more features or higher technical noise will have larger distances, and downscaling each matrix by the median distance will correct for differences between modes. At the same time, by only considering the nearest neighbors, we avoid capturing (and inadvertently eliminating) variance due to mode-specific population structure.

The default approach is to weight each mode equally during the rescaling process, i.e., the median distance to the kth nearest neighbor will be equal for all modes after rescaling. However, we can also set weights to control the fold-differences in the median distances. For example, a weight of 2 for one mode would mean that its median distance after rescaling is twice as large as that from a mode with a weight of 1. This may be useful for prioritizing modes that are more likely to be important.

The correspondence between non-NULL weights and the modes is slightly tricky whe x is not a list. If x is a SummarizedExperiment, the modes are ordered as: all entries in assays in the specified order, then all entries in extras. If x is a SingleCellExperiment, the modes are ordered as: all entries in assays in the specified order, then all entries in dimreds, then all entries in altexps, and finally all entries in extras.

Value

A numeric matrix with number of rows equal to the number of cells, where the columns span all variables across all modes supplied in x. Values are scaled so that each mode contributes the specified weight to downstream Euclidean distance calculations.

If combine=FALSE, a list of rescaled matrices is returned instead.

Author(s)

Aaron Lun

Examples

# Mocking up a gene expression + ADT dataset:
library(scater)
exprs_sce <- mockSCE()
exprs_sce <- logNormCounts(exprs_sce)
exprs_sce <- runPCA(exprs_sce)

adt_sce <- mockSCE(ngenes=20) 
adt_sce <- logNormCounts(adt_sce)
altExp(exprs_sce, "ADT") <- adt_sce

combined <- rescaleByNeighbors(exprs_sce, dimreds="PCA", altexps="ADT")
dim(combined)


LTLA/mumosa documentation built on Aug. 13, 2024, 1:31 a.m.