knitr::opts_chunk$set(fig.width=8, fig.height=5) options(width=96) .format <- knitr::opts_knit$get("rmarkdown.pandoc.to") .tag <- function(N, cap ) ifelse(.format == "html", paste("Figure", N, ":", cap), cap)
Recent years have seen the development of numerous algorithms and computational packages for the analysis of multi-omics data sets. At this point, one can find several review articles summarizing progress in the field [@subramanian2020; @graw2021; @heo2021; @picard2021; @reel2021; @vlachavas2021; @adossa2021]. As with other applications of machine learning, the kinds of problems addressed by these algorithms are divided into two categories: unsupervised (e.g., clustering or class discovery) or supervised (including class comparison and class prediction) [@simon2003]. Advances in the area of unsupervised learning have been broader and deeper than advances on supervised learning.
One of the most effective unsupervised methods is Multi-Omic Factor Analysis (MOFA) [@argelaguet2018; @argelaguet2020]. A key property of MOFA is that it does not require all omics assays to have been performed on all samples under study. In particular, it can effectively discover class structure across omics data sets even when data for many patients have only been acquired on a subset of the omics technologies. As of this writing, we do not know of any supervised multi-omics method that can effectively learn to predict outcomes when samples have only been assayed on a subset of the omics data sets.
MOFA starts with a standard method -- Latent Factor Analysis -- that is known to work well on a single omics data set. It then fits a coherent model that identifies latent factors that are common to, and able to explain the data well in, all the omics data sets under study. Our investigation (unpublished) of the factors found by MOFA suggests that, at least in come cases, it is approximately equivalent to a two-step process:
That re-interpretation of MOFA suggests that an analogous procedure might work for supervised analyses as well. In this article, we describe a two-step algorithm, which we call "plasma", to find models that can predict time-to-event outcomes on samples from multi-omics data sets even in the presence of incomplete data. We use partial least squares (PLS) for both steps, using Cox regression [@bertrand2021] to learn the single omics models and linear regression [@mishra2022] to learn how to extend models from one omics data set to another. To illustrate the method, we use a subset of the esophageal cancer (ESCA) data set from The Cancer Genome Atlas (TCGA).
The plasma
algorithm is based on Partial Least Squares (PLS), which has been shown to be
an effective method for finding components that can predict clinically interesting outcomes
[@bastien2015]. The workflow of the plasma algorithm is illustrated in Figure 1 in the
case of three omics data sets. First, for each of the omics data sets, we apply the PLS Cox
regression algorithm (plsRcox
Version r packageVersion("plsRcox")
[@bertrand2021]) to the
time-to-event outcome data to learn three separate predictive models (indicated in red, green,
and blue, respectively). Each of these models may be incomplete, since they are not defined
for patients who have not been assayed (shown in white) using that particular omics technology.
Second, for each pair of omics data sets, we apply the PLS linear regression algorithm
(pls
Version r packageVersion("pls")
[@mishra2022]) to learn how to predict the coefficients of
the Cox regression components from one data set using features from the other data set. This
step extends (shown in pastel red, green, and blue, resp.) each of the original models,
in different ways, from the intersection of samples assayed on both data sets to their union.
Third, we average all of the different extended models (ignoring missing data) to get a single
coherent model of component coefficients across all omics data sets. Assuming that this process
has been applied to learn the model from a training data set, we can evaluate the final Cox
regression model on both the training set and a test set of patient samples.
SF <- system.file("Figure/methods.png", package = "plasma") knitr::include_graphics(SF) rm(SF)
All computations were performed in r R.version.string
of the R Statistical Software Environment
[@Rbook]. Cox proportional hazards models for survival analysis were fit using version
r packageVersion("survival")
of the survival
R package. We used additional exploratory graphical
tools from version r packageVersion("beanplot")
of the beanplot
R package [@kampstra2008] and
version r packageVersion("Polychrome")
of the Polychrome
R package [@coombes2019]. Gene
enrichment (pathway or annotation) analysis was performed using ToppGene
(https://toppgene.cchmc.org/) [@chen2009].
The given data for a PLS model consists of a matrix, $Y$, of patient by outcomes and a matrix $X$ of patients by (potential predictive) features from an omics data set. The underlying model indicates that we want to find approximate decompositions $$X = TP^T + E, \qquad Y = UQ^T + F,$$ where $E$ and $F$ are error terms, $P$ and $Q$ are orthogonal loading matrices, and $T$ (the X-scores) and $U$ (the Y-scores) are the projections of the given data. In the multi-omic case, we assume that we have $I = 1, \dots, N$ different omics data sets containing potential predictors.
The first pass in the plasma algorithm is to solve this problem separately for each data set, obtaining solutions of the form $$X_I = T_I P_I^T + E_I, \qquad Y = U_IQ_I^T + F_I,$$ To keep track of dimensions, we let $p$ be the number of patients, $m$ the number of outcomes, $F_I$ the number of features in the $I^{th}$ data set, and $\ell_I$ the number of components desired from the $I^{th}$ data set. So,
Now suppose we concatenate all of the omics data matrices (i.e., do the mathematical
equivalent of cbind
in R) as
$$X = [X_1 | X_2 | \dots | X_N].$$
Now $X$ is a $n\times f$ matrix, where $f = \sum_I f_i$. Of course, we still have the same
outcome matrix, $Y$. We can decompose $X$ pretty easily: We have
$$X = [T^1P_1^T | \dots | T_NP_N^T] = [T_1 | \dots | T_N]*\textrm{diag}(P_1, \dots. P_N]^T = TP^T,$$
where $T$ is $n\times\ell$ and the block diagonal matrix $P$ is $f\times\ell$ for $\ell = \sum_I \ell_I$.
Next, we have to factor $Y$ as products of an $n\times\ell$ matrix $U$ and the transpose of an $m\times\ell$ matrix $Q$. If we take $$U = [U_1 | \dots | U_N],\quad\textrm{and}\quad Q = [Q_1 |, \dots | Q_N],$$ then they have the correct dimensions. But $$UQ^T = \sum_I U_I Q_I = NY,$$ so we have to divide by $N$ somewhere.
Now the second pass in the plasma
algorithm is to fit the X-scores, $T_I$, from each
data set using each of the other data sets $X_J$. This comes down to solving equations like
$$X_J = W_{IJ} S_{IJ}^T + error, \qquad T_I = V_{IJ}R_{IJ}^T + error.$$ This ought to correspond to computing the off-diagonal blocks in the decomposition above defining $P$.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.