embedSamples: Spectral embedding of biological samples

Description Usage Arguments Details Value Author(s) References See Also Examples

Description

Non-linear learning of a data representation that captures the intrinsic geometry of the trajectory. This function performs spectral decomposition of a graph encoding conditional entropy-based sample-to-sample similarities.

Usage

1
2
3
4
embedSamples(x, design = NULL)

## S4 method for signature 'matrix'
embedSamples(x, design = NULL)

Arguments

x

A SingleCellExperiment object or a numeric matrix with samples in columns and features in rows

design

A numeric matrix describing the factors that should be blocked

Details

Single-cell gene expression measurements comprise high-dimensional data of large volume, i.e. many features (e.g., genes) are measured in many samples (e.g., cells); or more formally, m samples can be described by the expression of n features (i.e., n dimensions). The cells’ expression profiles are shaped by many distinct unobserved biological causes related to each cell's geno- and phenotype, such as developmental age, tissue region of origin, cell cycle stage, as well as extrinsic sources such as status of signaling receptors, and environmental stressors, but also technical noise. In other words, a single dimension, despite just containing gene expression information, represents an underlying combination of multiple dependent and independent, relevant and non-relevant factors, whereat each factors’ individual contribution is non-uniform. To obtain a better resolution and to extract underlying information, CellTrails aims to find a meaningful low-dimensional structure - a manifold - that represents cells mainly by their temporal relation along a biological process.

This method assumes that the expression vectors are lying on or near a manifold with dimensionality d that is embedded in the n-dimensional space. By using spectral embedding CellTrails aims to amplify latent temporal information; it reduces noise (ie. truncates non-relevant dimensions) by transforming the expression matrix into a new dataset while retaining the geometry of the original dataset as much as possible.CellTrails captures overall cell-to-cell relations based on the statistical mutual dependency between any two data vectors. A high dependency between two samples should be represented by their close proximity in the lower-dimensional space.

First, the mutual depencency between samples is scored using mutual information. This entropy framework naturally requires discretization of data vectors by an indicator function, which assigns each continuous data point (expression value) to exactly one discrete interval (e.g. low, mid or high). However, measurement points located close to the interval borders may get wrongly assigned due to noise-induced fluctuations. Therefore, CellTrails fuzzifies the indicator function by using a piecewise polynomial function, i.e. the domain of each sample expression vector is divided into contiguous intervals (based on Daub et al., 2004). Second, the computed mutual information matrix, which is left-bounded and composed of bits, is scaled to a generalized correlation coefficient. Third, CellTrails constructs a simple complete graph with m nodes, one for each data vector (ie. sample), and weights each edge between two nodes by a heat kernel function applied on the generalzied correlation coefficient. Finally, nonlinear spectral embedding (ie. spectral decomposition of the graph's adjacency matrix) is performed (Belkin & Niyogi, 2003; Sussman et al., 2012) unfolding the manifold. Please note that this methods only uses the set of defined trajectory features in a SingleCellExperiment object; spike-in controls are ignored and are not listed as trajectory features.

To account for systematic bias in the expression data (e.g., cell cycle effects), a design matrix can be provided for the learning process. It should list the factors that should be blocked and their values per sample. It is suggested to construct a design matrix with model.matrix.

Diagnostic messages

The method throws an error if expression matrix contains samples with zero entropy (e.g., the samples exclusively contain non-detects, that is all expression values are zero).

Value

A list containing the following components:

eigenvectors

Ordered components of latent space

eigenvalues

Information content of latent components

Author(s)

Daniel C. Ellwanger

References

Daub, C.O., Steuer, R., Selbig, J., and Kloska, S. (2004). Estimating mutual information using B-spline functions – an improved similarity measure for analysing gene expression data. BMC Bioinformatics 5, 118.

Belkin, M., and Niyogi, P. (2003). Laplacian eigenmaps for dimensionality reduction and data representation. Neural computation 15, 1373-1396.

Sussman, D.L., Tang, M., Fishkind, D.E., and Priebe, C.E. (2012). A Consistent Adjacency Spectral Embedding for Stochastic Blockmodel Graphs. J Am Stat Assoc 107, 1119-1128.

See Also

SingleCellExperiment trajectoryFeatureNames model.matrix

Examples

1
2
3
4
5
# Example data
data(exSCE)

# Embed samples
res <- embedSamples(exSCE)

dcellwanger/CellTrails documentation built on May 12, 2020, 2:01 a.m.