knitr::opts_chunk$set(fig.width=8, fig.height=5) options(width=96) .format <- knitr::opts_knit$get("rmarkdown.pandoc.to") .tag <- function(N, cap ) ifelse(.format == "html", paste("Figure", N, ":", cap), cap)
Recent years have seen the development of numerous algorithms and computational packages for the analysis of multi-omics data sets. At this point, one can find several review articles summarizing progress in the field [@subramanian2020; @graw2021; @heo2021; @picard2021; @reel2021; @vlachavas2021; @adossa2021]. As with other applications of machine learning, the kinds of problems addressed by these algorithms are divided into two categories: unsupervised (e.g., clustering or class discovery) or supervised (including class comparison and class prediction) [@simon2003]. Advances in the area of unsupervised learning have been broader and deeper than advances on supervised learning.
One of the most effective unsupervised methods is Multi-Omic Factor Analysis (MOFA) [@argelaguet2018; @argelaguet2020]. A key property of MOFA is that it does not require all omics assays to have been performed on all samples under study. In particular, it can effectively discover class structure across omics data sets even when data for many patients have only been acquired on a subset of the omics technologies. As of this writing, we do not know of any supervised multi-omics method that can effectively learn to predict outcomes when samples have only been assayed on a subset of the omics data sets.
MOFA starts with a standard method -- Latent Factor Analysis -- that is known to work well on a single omics data set. It then fits a coherent model that identifies latent factors that are common to, and able to explain the data well in, all the omics data sets under study. Our investigation (unpublished) of the factors found by MOFA suggests that, at least in come cases, it is approximately equivalent to a two-step process:
That re-interpretation of MOFA suggests that an analogous procedure might work for supervised analyses as well. In this article, we describe a two-step algorithm, which we call "plasma", to find models that can predict time-to-event outcomes on samples from multi-omics data sets even in the presence of incomplete data. We use partial least squares (PLS) for both steps, using Cox regression [@bertrand2021] to learn the single omics models and linear regression [@mishra2022] to learn how to extend models from one omics data set to another. To illustrate the method, we use a subset of the esophageal cancer (ESCA) data set from The Cancer Genome Atlas (TCGA).
Our computational method is implemented and the data are available in the plasma
package.
suppressWarnings( library(plasma) ) packageVersion("plasma")
The results included here are in whole or part based upon data generated by the TCGA Research Network. We downloaded the entire esophageal cancer Level 3 data set [@tcga2017] from the FireBrowse web site (http://firebrowse.org/) [@deng2017] in August 2018. We filtered the data sets so that only the most variable, and presumably the most informative, features were retained. Here, we load this sample data set.
loadESCAdata() sapply(assemble, dim)
Finally, in order to be able to illustrate the ability of the plasma algorithm to work in the presence of missing data, we randomly selected 10% of the patients to remove from the miRSeq data set (leaving 166 patients) and 15% of the patients to remove from the mRNASeq data set (leaving 157 patients). We provide a summary of the outcome data below.
We recommend imputing small amounts of missing data in the input data sets. The underlying issue is that the PLS models we use for individual omics data sets will not be able to make predictions on a sample if even one data point is missing. As a result, if a sample is missing at least one data point in every omics data set, then it will be impossible to use that sample at all.
For a range of available methods and R packages, consult the
CRAN Task View on Missing Data.
We also recommend the R-miss-tastic web site on missing data.
Their simulations suggest that, for purposes of producing predictive models from omics data,
the imputation method is not particularly important. Because of the latter finding, we have only
implemented two simple imputation methods in the plasma
package:
meanModeImputer
will replace any missing data by the mean value of the observed data if
there are more than five distinct values; otherwise, it will replace missing data by the mode.
This approach works relatively well for both continuous data and for binary or small categorical
data.samplingImputer
replaces missing values by sampling randomly from the observed data distribution.set.seed(54321) imputed <- lapply(assemble, samplingImputer)
The plasma
algorithm is based on Partial Least Squares (PLS), which has been shown to be
an effective method for finding components that can predict clinically interesting outcomes
[@bastien2015]. The workflow of the plasma algorithm is illustrated in Figure 1 in the
case of three omics data sets. First, for each of the omics data sets, we apply the PLS Cox
regression algorithm (plsRcox
Version r packageVersion("plsRcox")
[@bertrand2021]) to the
time-to-event outcome data to learn three separate predictive models (indicated in red, green,
and blue, respectively). Each of these models may be incomplete, since they are not defined
for patients who have not been assayed (shown in white) using that particular omics technology.
Second, for each pair of omics data sets, we apply the PLS linear regression algorithm
(pls
Version r packageVersion("pls")
[@mishra2022]) to learn how to predict the coefficients of
the Cox regression components from one data set using features from the other data set. This
step extends (shown in pastel red, green, and blue, resp.) each of the original models,
in different ways, from the intersection of samples assayed on both data sets to their union.
Third, we average all of the different extended models (ignoring missing data) to get a single
coherent model of component coefficients across all omics data sets. Assuming that this process
has been applied to learn the model from a training data set, we can evaluate the final Cox
regression model on both the training set and a test set of patient samples.
SF <- system.file("Figure/methods.png", package = "plasma") knitr::include_graphics(SF) rm(SF)
All computations were performed in r R.version.string
of the R Statistical Software Environment
[@Rbook]. Cox proportional hazards models for survival analysis were fit using version
r packageVersion("survival")
of the survival
R package. We used additional exploratory graphical
tools from version r packageVersion("beanplot")
of the beanplot
R package [@kampstra2008] and
version r packageVersion("Polychrome")
of the Polychrome
R package [@coombes2019]. Gene
enrichment (pathway or annotation) analysis was performed using ToppGene
(https://toppgene.cchmc.org/) [@chen2009].
Because of the layered nature of the plasma algorithm, we intend to use the following terminology to help clarify the later discussions.
To be consistent with the MOFA2
R package [@argelaguet2020], all of the data sets are
arranged so that patient samples are columns and assay features are rows. Our first task
is to pad each data set with appropriate NA
's to ensure that each set includes the same
patient samples in the same order, where that order matches the outcome data frame.
MO <- prepareMultiOmics(imputed, Outcome) summary(MO)
We see that the number of patients in each data set is now equal to the number of patients with clinical outcome data.
As indicated above, we want to separate the data set into training and test samples. We will use 60% for training and 40% for testing.
set.seed(54321) splitVec <- rbinom(nrow(Outcome), 1, 0.6)
Figure 2 presents a graphical overview of the number of samples (N
) and the number of
features (D
) in each omics component of the training and test sets.
trainD <- MO[, splitVec == 1] testD <- MO[, splitVec == 0] opar <- par(mai = c(1.02, 1.32, 0.82, 0.22), mfrow = c(1,2)) plot(trainD, main = "Train") plot(testD, main = "Test") par(opar) rm(opar)
The first step of the plasma
algorithm is to fit PLS Cox models on each omics data set using
the function fitCoxModels
. The returned object of class MultiplePLSCoxModels
contains a list
of SingleModel
objects, one for each assay, and within each there are three regression models:
plsRcoxmodel
contains the coefficients of the components learned by PLS Cox regression.
The number of components is determined automatically as a function of the logarithm of the
number of features in the omics data set. The output of this model is a continuous prediction
of "risk" for the time-to-event outcome of interest.riskModel
is a coxph
model using continuous predicted risk as a single predictor.splitModel
is a coxph
model using a binary split of the risk (at the median) as
the predictor.set.seed(23258) suppressWarnings( firstPass <- fitCoxModels(trainD, timevar = "Days", eventvar = "vital_status", eventvalue = "dead") )
if (!interactive()) { plot(firstPass, legloc = "bottomleft") # margins too small inside RStudio window }
On the training set, each of the seven contributing omics data sets is able to find a PLS model that can successfully separate high risk from low risk patients (Figure 3).
The second step of the algorithm is to extend the individual omics-based models across other omics
data sets. This step is performed using the plasma
function, which takes in the previously
created objects of class multiOmics
and MultiplePLSCoxModels
. The function operates
iteratively, so in our case there are seven different sets of predictions of the PLS
components. These different predictions are averaged and saved internally as a data frame
called meanPredictions
. The structure of models created and stored in the plasma
object is
the same as for the separate, individual, omics models. Figure 4A shows the Kaplan-Meier plot
using the predicted risk, split at the median value, on the training data set.
Now we want to see how well the final composite model generalizes to our test set. Figure 4B uses the predicted risk, split at the median of the training data, to construct a Kaplan-Meier plot on the test data. The model yields a statistically significant (p = 0.0063) separation of outcomes between the high and low risk patients.
pl <- plasma(MO, firstPass) testpred <- predict(pl, testD)
opar <- par(mfrow=c(2,1)) plot(pl, legloc = "topright", main = "Training Data", xlab = "Time (Days)") mtext("A", side = 2, at = 1.2, line = 3, cex = 1.5, las = 2) plot(testpred, main="Testing Data", xlab = "Time (Days)") mtext("B", side = 2, at = 1.2, line = 3, cex = 1.5, las = 2) par(opar) rm(opar)
We want to compare the test results of the joint omics model to the predictions we make on the test data from the separate single-omics models. The Kaplan-Meier plots in Figure 5 show that most of the individual models have poor performance on the test data, especially when compared to the results of the combined model learned jointly from all the omics data sets.
mypreds <- predict(firstPass, testD, type = "survfit") if (!interactive()) { opar <- par(mfrow = c(4, 2)) #, mai = c(0.2, 0.2, 0.2, 0.2)) sapply(names(mypreds), function(N) { fit <- mypreds[[N]] col = c("blue", "red") plot(fit, col = col, lwd = 2, main= N, xlab = "Time (Days)", ylab = "Fraction Surviving") legend("topright", paste(c("low", "high"), "risk"), col = col, lwd = 2) }) par(opar) }
Next, we want to compare the plasma method to an alternative approach using MOFA.
Here the idea is to first run the unsupervised MOFA algorithm to find latent factors,
and then use the latent factors as potential predictors in a Cox proportional
hazards model. We use the training data from our plasma analysis as inputs to MOFA2
.
hasMOFA <- suppressMessages( require("MOFA2", quietly = TRUE) )
suppressMessages( mofaObject <- create_mofa(trainD@data) )
We set the options to the MOFA algorithm by indicating which data sets are binary and by choosing to find 10 latent factors.
dataoptions <- get_default_data_options(mofaObject) modeloptions <- MOFA2::get_default_model_options(mofaObject) modeloptions$num_factors <- NF <- 10 modeloptions[["likelihoods"]][["ClinicalBin"]] <- "bernoulli" modeloptions[["likelihoods"]][["MAF"]] <- "bernoulli" trainingoptions <- get_default_training_options(mofaObject) suppressMessages( suppressWarnings( mofaObject <- prepare_mofa(mofaObject, data_options = dataoptions, model_options = modeloptions, training_options = trainingoptions) ) )
Now we can run the algorithm.
suppressMessages( suppressWarnings( mofaESCA <- run_mofa(mofaObject, use_basilisk = TRUE) ) )
We want to see whether any of the latent factors are associated with overall survival.
LF <- get_factors(mofaESCA, groups = "all", factors = "all") [[1]] LFwithOutcome <- data.frame(LF, trainD@outcome[, c("Days", "vital_status")]) osmodel <- coxph(Surv(Days, vital_status == "dead") ~ ., data = LFwithOutcome) summary(osmodel) AIC <- step(osmodel, trace = 0) AIC summary(AIC)
Factors 2, 3, and 4 are associated with survival, although the model itself is, at best, only borderline significant.
mofarisk <- survival:::predict.coxph(AIC) traindata <- data.frame(trainD@outcome, mofarisk) traindata$Cut <- mofarisk > 0 mofamodel <- coxph(Surv(Days, vital_status == "dead") ~ Cut, data = traindata) Strain <- summary(mofamodel)
The hard part is to reverse-engineer the MOFA model so we can use its decompositions to make predictions on the test data.
weightmats <- get_weights(mofaESCA) Woo <- do.call(rbind, weightmats) testMats <- testD@data testShift <- lapply(testMats, function(X) { if (length(unique(as.vector(X))) < 10) { X } else { t(scale(t(X), scale = FALSE)) } }) Yoo <- do.call(rbind, testShift) Voo <- t(qr.solve(Woo, Yoo)) punk <- survival:::predict.coxph(AIC, as.data.frame(Voo)) testdata <- data.frame(testD@outcome, punk) testdata$Cut <- punk > 0 testmodel <- coxph(Surv(Days, vital_status == "dead") ~ Cut, data = testdata) Stest <- summary(testmodel)
col2 <- c("blue", "red") opar <- par(mfrow=c(2,1)) plot(survfit(Surv(Days, vital_status == "dead") ~ Cut, data = traindata), col = col2, lwd = 3, main = "MOFA model, training data", xlab = "Time (Days)", sub = paste("Score test p-value = ", round(Strain$sctest[3], 3))) legend("topright", c("Low Risk", "High Risk"), col = col2, lwd = 2) mtext("A", side = 2, at = 1.2, line = 3, cex = 1.5, las = 2) plot(survfit(Surv(Days, vital_status == "dead") ~ Cut, data = testdata), xlab = "Time (Days)", col = col2, lwd = 3, main = "MOFA model, test data", sub = round(Stest$sctest[3], 3)) legend("topright", c("Low Risk", "High Risk"), col = col2, lwd = 2) mtext("B", side = 2, at = 1.2, line = 3, cex = 1.5, las = 2) par(opar) rm(opar)
The plasma
package includes several tools to assist with the interpretation
of components and of the features contributing to the final model. These are
illustrated in the "interpretation" vignette
We have described a method analogous to that of MOFA
that allows us
to combine different omics data sets without the need for massive prior
imputation of missing values. A major difference is that while the
MOFA
model learns "factors" that are composites of the variables in an
unsupervised fashion, the plasma
model learns "components" that are
composites of the variables in a supervised fashion, using the
outcomes "event" and "time-to-event" as response variables.
Although the factors from MOFA
are defined such that the first
factor, Factor 1, accounts for the greatest variance in the model, the
factors may or may not be significantly associated with the outcome,
and a post-hoc survival analysis would need to be done to assess
this. It may be the case that some factors, although they are
significantly associated with outcome, account for very small variance
in the final MOFA
model, which hinders interpretability. This was
the case with the TCGA-ESCA data set, in which, when 10 factors were
learned from the MOFA
model, only three factoes were significantly
associated with survival, while not generalizing to the test data set.
Bu contrast, the components for plasma
are created in a way that
maximizes the covariance in the predictors and the response, and therefore these
components will be automatically associated to some degree with the
outcome. This observation could be advantageous in that dissecting
the weights associated with the components would yield a list of variables from
different omics data sets that contribute the most to defining the
outcome, and any additional analyses could be refined by looking at
these high-weighted variables most closely.
We were surprised to find that most of the models from the individual data sets did not generalize well to the test data. Nevertheless, the joint model that combined the components from those data sets had good performance in predicting the risk of patient in the test data set. It is unclear why this happens. We speculate that the phenomenon may be related to the more genral idea of "bagging" weak predictors to create a more effective composite model [@boulesteix2010; @abdia2017; @hillebrand2021; @poudel2022]. MOre research will be required to determine how general this phenomenon is.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.