plasma: Theoretical Considerations"
In plasma: Partial LeAst Squares for Multiomic Analysis

knitr::opts_chunk$set(fig.width=8, fig.height=5)
options(width=96)
.format <- knitr::opts_knit$get("rmarkdown.pandoc.to")
.tag <- function(N, cap ) ifelse(.format == "html",
                                 paste("Figure", N, ":",  cap),
                                 cap)

Background

Recent years have seen the development of numerous algorithms and computational packages for the analysis of multi-omics data sets. At this point, one can find several review articles summarizing progress in the field [@subramanian2020; @graw2021; @heo2021; @picard2021; @reel2021; @vlachavas2021; @adossa2021]. As with other applications of machine learning, the kinds of problems addressed by these algorithms are divided into two categories: unsupervised (e.g., clustering or class discovery) or supervised (including class comparison and class prediction) [@simon2003]. Advances in the area of unsupervised learning have been broader and deeper than advances on supervised learning.

One of the most effective unsupervised methods is Multi-Omic Factor Analysis (MOFA) [@argelaguet2018; @argelaguet2020]. A key property of MOFA is that it does not require all omics assays to have been performed on all samples under study. In particular, it can effectively discover class structure across omics data sets even when data for many patients have only been acquired on a subset of the omics technologies. As of this writing, we do not know of any supervised multi-omics method that can effectively learn to predict outcomes when samples have only been assayed on a subset of the omics data sets.

MOFA starts with a standard method -- Latent Factor Analysis -- that is known to work well on a single omics data set. It then fits a coherent model that identifies latent factors that are common to, and able to explain the data well in, all the omics data sets under study. Our investigation (unpublished) of the factors found by MOFA suggests that, at least in come cases, it is approximately equivalent to a two-step process:

Use principal components analysis to identify initial latent factors in each individual omics data set.
For each pair of omics data sets, use overlapping samples to train and extend models of each factor to the union of assayed samples.

That re-interpretation of MOFA suggests that an analogous procedure might work for supervised analyses as well. In this article, we describe a two-step algorithm, which we call "plasma", to find models that can predict time-to-event outcomes on samples from multi-omics data sets even in the presence of incomplete data. We use partial least squares (PLS) for both steps, using Cox regression [@bertrand2021] to learn the single omics models and linear regression [@mishra2022] to learn how to extend models from one omics data set to another. To illustrate the method, we use a subset of the esophageal cancer (ESCA) data set from The Cancer Genome Atlas (TCGA).

Approach

The plasma algorithm is based on Partial Least Squares (PLS), which has been shown to be an effective method for finding components that can predict clinically interesting outcomes [@bastien2015]. The workflow of the plasma algorithm is illustrated in Figure 1 in the case of three omics data sets. First, for each of the omics data sets, we apply the PLS Cox regression algorithm (plsRcox Version r packageVersion("plsRcox") [@bertrand2021]) to the time-to-event outcome data to learn three separate predictive models (indicated in red, green, and blue, respectively). Each of these models may be incomplete, since they are not defined for patients who have not been assayed (shown in white) using that particular omics technology. Second, for each pair of omics data sets, we apply the PLS linear regression algorithm (pls Version r packageVersion("pls") [@mishra2022]) to learn how to predict the coefficients of the Cox regression components from one data set using features from the other data set. This step extends (shown in pastel red, green, and blue, resp.) each of the original models, in different ways, from the intersection of samples assayed on both data sets to their union. Third, we average all of the different extended models (ignoring missing data) to get a single coherent model of component coefficients across all omics data sets. Assuming that this process has been applied to learn the model from a training data set, we can evaluate the final Cox regression model on both the training set and a test set of patient samples.

SF <- system.file("Figure/methods.png", package = "plasma")
knitr::include_graphics(SF)
rm(SF)

All computations were performed in r R.version.string of the R Statistical Software Environment [@Rbook]. Cox proportional hazards models for survival analysis were fit using version r packageVersion("survival") of the survival R package. We used additional exploratory graphical tools from version r packageVersion("beanplot") of the beanplot R package [@kampstra2008] and version r packageVersion("Polychrome") of the Polychrome R package [@coombes2019]. Gene enrichment (pathway or annotation) analysis was performed using ToppGene (https://toppgene.cchmc.org/) [@chen2009].

Theory

The given data for a PLS model consists of a matrix, $Y$, of patient by outcomes and a matrix $X$ of patients by (potential predictive) features from an omics data set. The underlying model indicates that we want to find approximate decompositions $$X = TP^T + E, \qquad Y = UQ^T + F,$$ where $E$ and $F$ are error terms, $P$ and $Q$ are orthogonal loading matrices, and $T$ (the X-scores) and $U$ (the Y-scores) are the projections of the given data. In the multi-omic case, we assume that we have $I = 1, \dots, N$ different omics data sets containing potential predictors.

The first pass in the plasma algorithm is to solve this problem separately for each data set, obtaining solutions of the form $$X_I = T_I P_I^T + E_I, \qquad Y = U_IQ_I^T + F_I,$$ To keep track of dimensions, we let $p$ be the number of patients, $m$ the number of outcomes, $F_I$ the number of features in the $I^{th}$ data set, and $\ell_I$ the number of components desired from the $I^{th}$ data set. So,

$Y$ is $p\times m$;
$X_I$ is $n\times f_I$;
$T_I$ and $U_I$ are $n\times\ell_I$;
$P_I$ is $f_I\times\ell_I$; and
$Q_I$ is $m\times\ell_I$.

Now suppose we concatenate all of the omics data matrices (i.e., do the mathematical equivalent of cbind in R) as $$X = [X_1 | X_2 | \dots | X_N].$$ Now $X$ is a $n\times f$ matrix, where $f = \sum_I f_i$. Of course, we still have the same outcome matrix, $Y$. We can decompose $X$ pretty easily: We have $$X = [T^1P_1^T | \dots | T_NP_N^T] = [T_1 | \dots | T_N]*\textrm{diag}(P_1, \dots. P_N]^T = TP^T,$$ where $T$ is $n\times\ell$ and the block diagonal matrix $P$ is $f\times\ell$ for $\ell = \sum_I \ell_I$.

Next, we have to factor $Y$ as products of an $n\times\ell$ matrix $U$ and the transpose of an $m\times\ell$ matrix $Q$. If we take $$U = [U_1 | \dots | U_N],\quad\textrm{and}\quad Q = [Q_1 |, \dots | Q_N],$$ then they have the correct dimensions. But $$UQ^T = \sum_I U_I Q_I = NY,$$ so we have to divide by $N$ somewhere.

Now the second pass in the plasma algorithm is to fit the X-scores, $T_I$, from each data set using each of the other data sets $X_J$. This comes down to solving equations like

$$X_J = W_{IJ} S_{IJ}^T + error, \qquad T_I = V_{IJ}R_{IJ}^T + error.$$ This ought to correspond to computing the off-diagonal blocks in the decomposition above defining $P$.