plasma: Partial LeAst Squares for Multi-omics Analysis"

knitr::opts_chunk$set(fig.width=8, fig.height=5)
options(width=96)
.format <- knitr::opts_knit$get("rmarkdown.pandoc.to")
.tag <- function(N, cap ) ifelse(.format == "html",
                                 paste("Figure", N, ":",  cap),
                                 cap)

Background

Recent years have seen the development of numerous algorithms and computational packages for the analysis of multi-omics data sets. At this point, one can find several review articles summarizing progress in the field [@subramanian2020; @graw2021; @heo2021; @picard2021; @reel2021; @vlachavas2021; @adossa2021]. As with other applications of machine learning, the kinds of problems addressed by these algorithms are divided into two categories: unsupervised (e.g., clustering or class discovery) or supervised (including class comparison and class prediction) [@simon2003]. Advances in the area of unsupervised learning have been broader and deeper than advances on supervised learning.

One of the most effective unsupervised methods is Multi-Omic Factor Analysis (MOFA) [@argelaguet2018; @argelaguet2020]. A key property of MOFA is that it does not require all omics assays to have been performed on all samples under study. In particular, it can effectively discover class structure across omics data sets even when data for many patients have only been acquired on a subset of the omics technologies. As of this writing, we do not know of any supervised multi-omics method that can effectively learn to predict outcomes when samples have only been assayed on a subset of the omics data sets.

MOFA starts with a standard method -- Latent Factor Analysis -- that is known to work well on a single omics data set. It then fits a coherent model that identifies latent factors that are common to, and able to explain the data well in, all the omics data sets under study. Our investigation (unpublished) of the factors found by MOFA suggests that, at least in come cases, it is approximately equivalent to a two-step process:

  1. Use principal components analysis to identify initial latent factors in each individual omics data set.
  2. For each pair of omics data sets, use overlapping samples to train and extend models of each factor to the union of assayed samples.

That re-interpretation of MOFA suggests that an analogous procedure might work for supervised analyses as well. In this article, we describe a two-step algorithm, which we call "plasma", to find models that can predict time-to-event outcomes on samples from multi-omics data sets even in the presence of incomplete data. We use partial least squares (PLS) for both steps, using Cox regression [@bertrand2021] to learn the single omics models and linear regression [@mishra2022] to learn how to extend models from one omics data set to another. To illustrate the method, we use a subset of the esophageal cancer (ESCA) data set from The Cancer Genome Atlas (TCGA).

Methods

Our computational method is implemented and the data are available in the plasma package.

suppressWarnings( library(plasma) )
packageVersion("plasma")

Data

The results included here are in whole or part based upon data generated by the TCGA Research Network. We downloaded the entire esophageal cancer Level 3 data set [@tcga2017] from the FireBrowse web site (http://firebrowse.org/) [@deng2017] in August 2018. We filtered the data sets so that only the most variable, and presumably the most informative, features were retained. Here, we load this sample data set.

loadESCAdata()
sapply(assemble, dim)
  1. From TCGA, we obtained 162 columns of clinical, demographic, and laboratory data on 185 patients. We removed any columns that always contained the same value. We also removed any columns whose values were missing in more than 25% of the patients. We converted categorical variables into sets of binary variables using one-hot-encoding. We then separated the clinical data into three parts:
    1. Outcome (overall survival)
    2. Binary covariates (53 columns)
    3. Continuous covariates (6 columns)
  2. Exome sequencing data for 184 patients with esophageal cancer was obtained as mutation allele format (MAF) files. We removed any gene that was mutated in fewer than 3% of the samples. The resulting data set contained 566 mutated genes.
  3. Methylation data for 185 ESCA patients was obtained as beta values computed by the TCGA from Illumina Methylation 450K microarrays. We removed any CpG site for which the standard deviation of the beta values was less than 0.3. The resulting data set contained 1,454 highly variable CpG's.
  4. Already normalized sequencing data on 2,566 microRNAs (miRs) was obtained for 185 patients. We removed any miR for which the standard deviation of normalized expression was less than 0.05, which left 926 miRs in the final data set.
  5. Already normalized sequencing data on 20,531 mRNAs was obtained in 184 patients. We removed any mRNA whose mean normalized expression was less than 6 or whose standard deviation was less than 1.2. The final data set included 2,520 mRNAs.
  6. Normalized expression data from reverse phase protein arrays (RPPA) was obtained from antibodies targeting 192 proteins in 126 patients. All data were retained for further analysis.

Finally, in order to be able to illustrate the ability of the plasma algorithm to work in the presence of missing data, we randomly selected 10% of the patients to remove from the miRSeq data set (leaving 166 patients) and 15% of the patients to remove from the mRNASeq data set (leaving 157 patients). We provide a summary of the outcome data below.

Imputation

We recommend imputing small amounts of missing data in the input data sets. The underlying issue is that the PLS models we use for individual omics data sets will not be able to make predictions on a sample if even one data point is missing. As a result, if a sample is missing at least one data point in every omics data set, then it will be impossible to use that sample at all.

For a range of available methods and R packages, consult the CRAN Task View on Missing Data. We also recommend the R-miss-tastic web site on missing data. Their simulations suggest that, for purposes of producing predictive models from omics data, the imputation method is not particularly important. Because of the latter finding, we have only implemented two simple imputation methods in the plasma package:

  1. meanModeImputer will replace any missing data by the mean value of the observed data if there are more than five distinct values; otherwise, it will replace missing data by the mode. This approach works relatively well for both continuous data and for binary or small categorical data.
  2. samplingImputer replaces missing values by sampling randomly from the observed data distribution.
set.seed(54321)
imputed <- lapply(assemble, samplingImputer)

Computational Approach

The plasma algorithm is based on Partial Least Squares (PLS), which has been shown to be an effective method for finding components that can predict clinically interesting outcomes [@bastien2015]. The workflow of the plasma algorithm is illustrated in Figure 1 in the case of three omics data sets. First, for each of the omics data sets, we apply the PLS Cox regression algorithm (plsRcox Version r packageVersion("plsRcox") [@bertrand2021]) to the time-to-event outcome data to learn three separate predictive models (indicated in red, green, and blue, respectively). Each of these models may be incomplete, since they are not defined for patients who have not been assayed (shown in white) using that particular omics technology. Second, for each pair of omics data sets, we apply the PLS linear regression algorithm (pls Version r packageVersion("pls") [@mishra2022]) to learn how to predict the coefficients of the Cox regression components from one data set using features from the other data set. This step extends (shown in pastel red, green, and blue, resp.) each of the original models, in different ways, from the intersection of samples assayed on both data sets to their union. Third, we average all of the different extended models (ignoring missing data) to get a single coherent model of component coefficients across all omics data sets. Assuming that this process has been applied to learn the model from a training data set, we can evaluate the final Cox regression model on both the training set and a test set of patient samples.

SF <- system.file("Figure/methods.png", package = "plasma")
knitr::include_graphics(SF)
rm(SF)

All computations were performed in r R.version.string of the R Statistical Software Environment [@Rbook]. Cox proportional hazards models for survival analysis were fit using version r packageVersion("survival") of the survival R package. We used additional exploratory graphical tools from version r packageVersion("beanplot") of the beanplot R package [@kampstra2008] and version r packageVersion("Polychrome") of the Polychrome R package [@coombes2019]. Gene enrichment (pathway or annotation) analysis was performed using ToppGene (https://toppgene.cchmc.org/) [@chen2009].

Terminology

Because of the layered nature of the plasma algorithm, we intend to use the following terminology to help clarify the later discussions.

  1. The input data contains a list of omics data sets.
  2. Each omics data set contains measurements of multiple features.
  3. The first step in the algorithm uses PLS Cox regression to find a set of components. Each component is a linear combination of features. The components are used as predictors in a Cox proportional hazards model, which predicts the log hazard ratio as a linear combination of components.
  4. The second step in the algorithm creates a secondary layer of components. We do not give these components a separate name. They are not an item of particular focus; we view them as a way to extend the first level components to more samples by "re-interpreting" them in other omics data sets.

Preparing the Data

To be consistent with the MOFA2 R package [@argelaguet2020], all of the data sets are arranged so that patient samples are columns and assay features are rows. Our first task is to pad each data set with appropriate NA's to ensure that each set includes the same patient samples in the same order, where that order matches the outcome data frame.

MO <- prepareMultiOmics(imputed, Outcome)
summary(MO)

We see that the number of patients in each data set is now equal to the number of patients with clinical outcome data.

Split Into Training and Test

As indicated above, we want to separate the data set into training and test samples. We will use 60% for training and 40% for testing.

set.seed(54321)
splitVec <- rbinom(nrow(Outcome), 1, 0.6)

Figure 2 presents a graphical overview of the number of samples (N) and the number of features (D) in each omics component of the training and test sets.

trainD <- MO[, splitVec == 1]
testD <- MO[, splitVec == 0]
opar <- par(mai = c(1.02, 1.32, 0.82, 0.22), mfrow = c(1,2))
plot(trainD, main = "Train")
plot(testD, main = "Test")
par(opar)
rm(opar)

Results

Individual PLS Cox Regression Models

The first step of the plasma algorithm is to fit PLS Cox models on each omics data set using the function fitCoxModels. The returned object of class MultiplePLSCoxModels contains a list of SingleModel objects, one for each assay, and within each there are three regression models:

set.seed(23258)
suppressWarnings( firstPass <- fitCoxModels(trainD, timevar = "Days",
                          eventvar = "vital_status", eventvalue = "dead") )
if (!interactive()) {
  plot(firstPass, legloc = "bottomleft") # margins too small inside RStudio window
}

On the training set, each of the seven contributing omics data sets is able to find a PLS model that can successfully separate high risk from low risk patients (Figure 3).

Extend Model Components Across Omics Data Sets

The second step of the algorithm is to extend the individual omics-based models across other omics data sets. This step is performed using the plasma function, which takes in the previously created objects of class multiOmics and MultiplePLSCoxModels. The function operates iteratively, so in our case there are seven different sets of predictions of the PLS components. These different predictions are averaged and saved internally as a data frame called meanPredictions. The structure of models created and stored in the plasma object is the same as for the separate, individual, omics models. Figure 4A shows the Kaplan-Meier plot using the predicted risk, split at the median value, on the training data set.

Independent Test Set

Now we want to see how well the final composite model generalizes to our test set. Figure 4B uses the predicted risk, split at the median of the training data, to construct a Kaplan-Meier plot on the test data. The model yields a statistically significant (p = 0.0063) separation of outcomes between the high and low risk patients.

pl <- plasma(MO, firstPass)
testpred <- predict(pl, testD)
opar <- par(mfrow=c(2,1))
plot(pl, legloc = "topright", main = "Training Data", xlab = "Time (Days)")
mtext("A", side = 2, at = 1.2, line = 3, cex = 1.5, las = 2)
plot(testpred, main="Testing Data", xlab = "Time (Days)")
mtext("B", side = 2, at = 1.2, line = 3, cex = 1.5, las = 2)
par(opar)
rm(opar)

Single Omics

We want to compare the test results of the joint omics model to the predictions we make on the test data from the separate single-omics models. The Kaplan-Meier plots in Figure 5 show that most of the individual models have poor performance on the test data, especially when compared to the results of the combined model learned jointly from all the omics data sets.

mypreds <- predict(firstPass, testD, type = "survfit")
if (!interactive()) {
  opar <- par(mfrow = c(4, 2)) #, mai = c(0.2, 0.2, 0.2, 0.2))
  sapply(names(mypreds), function(N) {
    fit <- mypreds[[N]]
    col = c("blue", "red")
    plot(fit, col = col, lwd = 2, main= N, xlab = "Time (Days)",
         ylab = "Fraction Surviving")
    legend("topright", paste(c("low", "high"), "risk"), col = col, lwd = 2)
  })
  par(opar)
}

MOFA

Next, we want to compare the plasma method to an alternative approach using MOFA. Here the idea is to first run the unsupervised MOFA algorithm to find latent factors, and then use the latent factors as potential predictors in a Cox proportional hazards model. We use the training data from our plasma analysis as inputs to MOFA2.

hasMOFA <- suppressMessages( require("MOFA2", quietly = TRUE) )
suppressMessages( mofaObject <- create_mofa(trainD@data) )

We set the options to the MOFA algorithm by indicating which data sets are binary and by choosing to find 10 latent factors.

dataoptions <- get_default_data_options(mofaObject)
modeloptions <- MOFA2::get_default_model_options(mofaObject)
modeloptions$num_factors <- NF <- 10
modeloptions[["likelihoods"]][["ClinicalBin"]] <- "bernoulli"
modeloptions[["likelihoods"]][["MAF"]] <- "bernoulli"
trainingoptions <- get_default_training_options(mofaObject)
suppressMessages( suppressWarnings(
  mofaObject <- prepare_mofa(mofaObject, data_options = dataoptions, 
                             model_options = modeloptions,
                             training_options = trainingoptions)
  )
)

Now we can run the algorithm.

suppressMessages( suppressWarnings(
  mofaESCA <- run_mofa(mofaObject, use_basilisk = TRUE)
  )
)

We want to see whether any of the latent factors are associated with overall survival.

LF <- get_factors(mofaESCA, groups = "all", factors = "all") [[1]]
LFwithOutcome <- data.frame(LF, trainD@outcome[, c("Days", "vital_status")])
osmodel <- coxph(Surv(Days, vital_status == "dead") ~ ., data = LFwithOutcome)
summary(osmodel)
AIC <- step(osmodel, trace = 0)
AIC
summary(AIC)

Factors 2, 3, and 4 are associated with survival, although the model itself is, at best, only borderline significant.

mofarisk <- survival:::predict.coxph(AIC)
traindata <- data.frame(trainD@outcome, mofarisk)
traindata$Cut <- mofarisk > 0
mofamodel <- coxph(Surv(Days, vital_status == "dead") ~ Cut, data = traindata)
Strain <- summary(mofamodel)

The hard part is to reverse-engineer the MOFA model so we can use its decompositions to make predictions on the test data.

weightmats <- get_weights(mofaESCA)
Woo <- do.call(rbind, weightmats)
testMats <- testD@data
testShift <- lapply(testMats, function(X) {
  if (length(unique(as.vector(X))) < 10) {
    X
  } else {
    t(scale(t(X), scale = FALSE))
  }
})
Yoo <- do.call(rbind, testShift)
Voo <- t(qr.solve(Woo, Yoo))
punk <- survival:::predict.coxph(AIC, as.data.frame(Voo))
testdata <- data.frame(testD@outcome, punk)
testdata$Cut <- punk > 0
testmodel <- coxph(Surv(Days, vital_status == "dead") ~ Cut, data = testdata)
Stest <- summary(testmodel)
col2 <- c("blue", "red")
opar <- par(mfrow=c(2,1))
plot(survfit(Surv(Days, vital_status == "dead") ~ Cut, data = traindata),
     col = col2, lwd = 3, main = "MOFA model, training data", xlab = "Time (Days)",
     sub = paste("Score test p-value = ", round(Strain$sctest[3], 3)))
legend("topright", c("Low Risk", "High Risk"), col = col2, lwd = 2)
mtext("A", side = 2, at = 1.2, line = 3, cex = 1.5, las = 2)
plot(survfit(Surv(Days, vital_status == "dead") ~ Cut, data = testdata),
     xlab = "Time (Days)",      col = col2, lwd = 3, main = "MOFA model, test data",
     sub = round(Stest$sctest[3], 3))
legend("topright", c("Low Risk", "High Risk"), col = col2, lwd = 2)
mtext("B", side = 2, at = 1.2, line = 3, cex = 1.5, las = 2)
par(opar)
rm(opar)

Interpretation

The plasma package includes several tools to assist with the interpretation of components and of the features contributing to the final model. These are illustrated in the "interpretation" vignette

Conclusions

We have described a method analogous to that of MOFA that allows us to combine different omics data sets without the need for massive prior imputation of missing values. A major difference is that while the MOFA model learns "factors" that are composites of the variables in an unsupervised fashion, the plasma model learns "components" that are composites of the variables in a supervised fashion, using the outcomes "event" and "time-to-event" as response variables.

Although the factors from MOFA are defined such that the first factor, Factor 1, accounts for the greatest variance in the model, the factors may or may not be significantly associated with the outcome, and a post-hoc survival analysis would need to be done to assess this. It may be the case that some factors, although they are significantly associated with outcome, account for very small variance in the final MOFA model, which hinders interpretability. This was the case with the TCGA-ESCA data set, in which, when 10 factors were learned from the MOFA model, only three factoes were significantly associated with survival, while not generalizing to the test data set. Bu contrast, the components for plasma are created in a way that maximizes the covariance in the predictors and the response, and therefore these components will be automatically associated to some degree with the outcome. This observation could be advantageous in that dissecting the weights associated with the components would yield a list of variables from different omics data sets that contribute the most to defining the outcome, and any additional analyses could be refined by looking at these high-weighted variables most closely.

We were surprised to find that most of the models from the individual data sets did not generalize well to the test data. Nevertheless, the joint model that combined the components from those data sets had good performance in predicting the risk of patient in the test data set. It is unclear why this happens. We speculate that the phenomenon may be related to the more genral idea of "bagging" weak predictors to create a more effective composite model [@boulesteix2010; @abdia2017; @hillebrand2021; @poudel2022]. MOre research will be required to determine how general this phenomenon is.

References



Try the plasma package in your browser

Any scripts or data that you put into this service are public.

plasma documentation built on May 9, 2025, 3:01 a.m.