The molecular processes underlying genes, their expression and the resulting phenotypes manifest as different measurable features. Transcriptomics concerns the expression of gene-transcripts in the form of RNA, epigenomics concerns multiple aspects that affect transcription, such as DNA-methylation, genomics concerns the DNA sequence and in cancer especially its mutations, metabolomics concerns metabolites and so on. Each measurable aspect requires specialised equipment and protocols to obtain observations of each feature, but together they form a more complete characterisation of the molecular state of the sample compared to individual omics. Integrating data from multiple views is a problem that can be addressed with multi-view machine learning approaches and in fact, there already exist many multi-omic integration and analysis tools. For patient stratification the most relevant tools are joint dimensionality reduction methods and multi-view clustering methods.
In this vignette we will go over how COPS
can be used to run multi-
view clustering analysis within the same evaluation framework introduced in the
multi-cohort clustering vignette based on a single view (i.e., RNA-Seq data).
As an example we will use the TCGA breast cancer dataset which consists of
approximately 700 patients.
The multi_omic_clustering
-function works as a fixed interface to
several clustering methods and can be used to run several methods sequentially
on the same dataset. On the other hand, COPS
can be used to run
CV folds and multiple methods in parallel. However, due to the computational
requirements it may not be practical for some datasets. In such cases, it is
possible to manually separate and distribute computations for each method and
CV fold.
This vignette makes use of data outside of COPS
and therefore
requires additional packages to be installed.
if (!requireNamespace("TCGAbiolinks", quietly = TRUE)) { BiocManager::install("TCGAbiolinks") }
tabl <- " | Method | R-package | Description | Reference | |----------------|--------------|-------------------------------------------|------------| | iClusterPlus | iClusterPlus | A Bayesian latent-variable model based clustering approach that generates cluster assignment based on joint inference across data types. Uses a modified Monte Carlo Newton-Raphson algorithm for optimization. Maximum number of views: 4. | https://doi.org/10.1073/pnas.1208949110 | | iClusterBayes | iClusterPlus | A Bayesian latent-variable model based clustering approach that generates cluster assignment based on joint inference across data types. Uses Bayesian inference and the Metropolis-Hastings algorithm to sample from the latent variable posterior distribution. Maximum number of views: 6. | https://doi.org/10.1093/biostatistics/kxx017 | | IntNMF | IntNMF | An integrative approach for disease subtype classification based on non-negative matrix factorization. | https://doi.org/10.1371/journal.pone.0176278 | | MOFA2 | MOFA2 | Reconstructs a low-dimensional representation of the data using computationally efficient variational inference. Used for joint dimensionality reduction after which COPS applies single-view clustering algorithms implemented in the clustering_analysis-function. | https://doi.org/10.1186/s13059-020-02015-1 | | ANF | ANF | Based on patient affinity networks that are calculated for each omic data type and fused into one network that is used for spectral clustering. | https://doi.org/10.1016/j.ymeth.2018.05.020 | | kkmeans | COPS | Kernelized k-means implemented by COPS. Can use spectral approximation to initialize or use the global k-means algorithm to find global optimum. | | | kkmeanspp | COPS | Kernelized k-means++ implemented by COPS. Uses random initializations similar to k-means++. | | | mkkm_mr | COPS | Multiple Kernel K-Means with Matrix-induced Regularization calculates optimal weights for summing kernels such that redundancy is lowered while diversity of the selected kernels is increased. | https://doi.org/10.1609/aaai.v30i1.10249 | | ECMC | COPS | Multi-view clustering with enhanced consensus is a multiple kernel method that aims to decompose the kernels corresponding to each view into a consensus and a disagreement kernel. The consensus kernel alignment between views is optimised and the combined kernel can be clustered with kernelised k-means. | https://doi.org/10.1186/s12920-017-0306-x | " cat(tabl)
dat_list
: List of input data.frames for input.meta_data
: A data.frame or a list that includes meta data for each view. Used to append CV details in the COPS
pipeline. If it is not set, the metadata will still contain "id" which corresponds to the rownames of the first input. multi_view_methods
: Vector of algorithm names to be applied. Available options are listed in the table above.n_clusters
: Integer vector of the number of clusters to compute. distance_metric
: Distance metric for clustering factorized data (only applicable to MOFA).correlation_method
: Correlation method for distance_metric, if applicable. standardize_data
: If set, each input feature is standardized to unit variance before clustering.non_negativity_transform
: Vector of transformation method names for IntNMF to apply on each input. Options are: "logistic", "rank" or "offset2", non-matching inputs result in no transformation. NMF requires non-negative inputs, hence transorming the data may be necessary. view_distributions
: A vector specifying the distribution to use for each view. Used by iCluster+, iClusterBayes and MOFA2. Options are "gaussian", "bernoulli" and "poisson". icp_lambda
: iCluster+ L1 penalty for each view. icp_burnin
: iCluster+ number of MCMC burn in samples for approximating joint distribution of latent variables. icp_draw
: iCluster+ number of MCMC samples to draw after burn in for approximating joint distribution of latent variables. icp_maxiter
: iCluster+ maximum number of Newton-Rhapson (EM) iterations. icp_sdev
: iCluster+ MCMC random walk standard deviation. icp_eps
: iCluster+ algorithm convergence threshold. icb_burnin
: iClusteBayes number of samples for MCMC burn in. icb_draw
: iClusteBayes number of MCMC samples to draw after burn in. icb_sdev
: iClusteBayes MCMC random walk standard deviation. icb_thin
: iClusteBayes MCMC thinning, only one sample in every icb_thin samples will be used. nmf_maxiter
: Maximum number of iterations for IntNMF. nmf_st.count
: Count stability for IntNMF. nmf_n.ini
: Number of initializations for IntNMF. nmf_ini.nndsvd
: By default, IntNMF uses NNDSVD for initialization. It can be disabled, but is strongly recommended. nmf_scaling
: Determines how views are scaled for IntNMF. By default, the Frobenius norm ratio is used for scaling. A string other than the default will result in division by maximum value for each input. mofa_convergence_mode
: MOFA convergence threshold. Options: "fast" (0.0005%), "medium" (0.00005%) or "slow" (0.000005%) deltaELBO change.mofa_maxiter
: Maximum number of iterations for MOFA. mofa_environment
: If set, uses the specified Python environment (with mofapy). Defaults to a basilisk
environment managed by R.mofa_lib_path
: Path to libpython. May be required if using non-default mofa_environment
.anf_neighbors
: Number of neighbours to use in knn-graph for ANF.kernels
: Character vector of kernel names to use for each view. Options: kernels_center
: Logical vector specifying which kernels should be centered. kernels_normalize
: Logical vector specifying which kernels should be normalized.kernels_scale_norm
: Logical vector specifying which kernels should be scaled to unit F-norm. kernel_gammas
: Numeric vector specifying gamma for the Gaussian/RBF kernel: $e^{-\gamma\|\mathbf{x - x_i}\|_2}$.pathway_networks
: List of igraph
objects containing pathway networks. Required for pathway kernels.pamogk_restart
: Restart probability for PAMOGK RWR. kkmeans_maxiter
: Maximum number of iterations for kernel k-means.kkmeans_n_init
: Number of initializations for kernel k-means.mkkm_mr_lambda
: Regularization parameter for mkkm_mr
. If it is a vector, mkkm_mr
will be run for each element in the vector. mkkm_mr_tolerance
: Convergence threshold for mkkm_mr
.mkkm_mr_initialization
: If TRUE
, uses mkkm_mr
result to initialize kernel k-means, otherwise runs kernel k-means++ on combined kernel.mkkm_mr_mosek
: If TRUE
, uses Rmosek for convex optimization instead of CVXR for mkkm_mr
.ecmc_a
: Regularization parameter alpha for ECMC
. Higher values should "encourage" higher consensus.ecmc_b
: Regularization parameter beta for ECMC
Higher values should penalize disagreement more. ecmc_eps
: Convergence threshold for ECMC
ecmc_maxiter
: Maximum number of iterations for ECMC
ecmc_mkkm_mr
: If TRUE
, uses mkkm_mr
on consensus kernels obtained from ECMC
. Otherwise uses the average kernel and kernel k-means.data_is_kernels
: If TRUE
, input data is assumed to be kernel matrices. Otherwise kernels are computed based on input data and the given kernels
.zero_var_removal
: If TRUE
, removes all zero variance features from the data. mvc_threads
: Number of threads to use for supported operations.COPS
includes three kernel-based methods for pathway-based
multi-omics integration. Technically, it is also possible to use the single-view
pathway feature extraction methods and then apply a data-driven approach for
multi-omic integration. The kernel-based options are 1) "BWK" which is briefly
described below, 2) "PAMOGK" (https://doi.org/10.1093/bioinformatics/btaa655)
uses z-scores to select dysregulated genes, random walk with restart (RWR) to
expand the dysregulated gene set and shortest paths in each pathway networks to
compute and summarize multiple pathway kernels for each omic, and 3) "PIK"
(https://doi.org/10.1038/s41540-019-0086-3) uses pathway network adjacency
matrices (specifically normalized Laplacians) to define edge-based pathway kernels.
These methods generate one kernel for each pathway and input omic which are then
integrated by a multiple-kernel learning (MKL) approach. COPS
has
three possible implementations for combining kernels for clustering analysis.
The simplest is the average kernel which is theoretically equivalent to
concatenating the features. Then there is the Multiple Kernel K-Means with Matrix-
induced Regularization (https://doi.org/10.1609/aaai.v30i1.10249) which learns
optimal weights for summing the kernels and penalises redundancy between kernels.
And finally, the multi-view enhanced consensus clustering which identifies and
optimises consensus patterns between the input kernels before clustering. Note
that to use thse methods, it is necessary to install either Rmosek
or CVXR
.
The betweenness weighted kernel (BWK) is a simplification of PAMOGK, that
weighs features based on the degree of their corresponding node in a given
pathway network:
$$
k_G(\mathbf{x},\mathbf{x'}) = \left\langle \mathbf{\sqrt{g_G} \odot x,\sqrt{g_G} \odot x'} \right\rangle
$$
where $g_G$ is a vector of node betweenness centralities in $G$ that match the
features of $\mathbf{x}$ and $\odot$ is the element-wise product of two vectors.
This kernel is faster to compute than the PAMOGK, and similarly integrates pathway network information with gene-based data.
As mentioned in the introduction we will analyse the TCGA breast cancer dataset.
COPS
includes a subset of this dataset including gene-expression,
DNA-methylation and Copy Number Variants. The DNA-methylation has been summarized
for each gene so that the features can be mapped to genes for pathway-based
methods.
For datasets of this size the methods can take a long time to finish, hence we
will only demonstrate a direct application of the multi_omic_clustering
function to run the analysis once on the full dataset. As is the case with other
functions in the COPS
pipeline, this function can be called within
the complete evaluation framework including stability analysis with resampled
data (via cross-fold validation).
pipeline_start <- Sys.time() set.seed(0) mopw_clusters <- COPS::multi_omic_clustering( dat_list = COPS::brca_multi_omics, parallel = 1, multi_omic_methods = "mkkm_mr", kernels_center = TRUE, kernels_normalize = TRUE, mkkm_mr_mosek = FALSE, mkkm_mr_lambda = 1, kernels = rep("BWK", length(COPS::brca_multi_omics)), pathway_networks = COPS::pathway_networks, mvc_threads = 1, n_clusters = 2:8) print(paste("Finished clustering in", COPS:::time_taken_string(pipeline_start)))
The internal metrics can be computed with respect to each view and COPS
automatically evaluates the Silhouette for each view separately. However, we can
also run it manually, although doing so may require some post-processing for
readability.
omics_d <- list() omics_s <- list() for (o in names(COPS::brca_multi_omics)) { omics_d[[o]] <- COPS::clustering_dissimilarity_from_data( t(COPS::brca_multi_omics[[o]])) omics_s[[o]] <- COPS::clustering_metrics( mopw_clusters, dissimilarity = omics_d[[o]])$metrics omics_s[[o]]$metric <- paste0(o, "_", omics_s[[o]]$metric) omics_s[[o]] <- reshape2::dcast( omics_s[[o]], k + m ~ metric, value.var = "value") } omics_s <- Reduce(plyr::join, omics_s)
omics_s
For comparison, we can run the same method with standard linear kernels and let
COPS
perform internal evaluations:
set.seed(0) molk_res <- COPS::COPS( dat = COPS::brca_multi_omics, nruns = 2, nfolds = 5, parallel = 1, multi_omic_methods = "mkkm_mr", kernels_center = TRUE, kernels_normalize = TRUE, mkkm_mr_mosek = FALSE, mkkm_mr_lambda = 1, kernels = rep("linear", length(COPS::brca_multi_omics)), mvc_threads = 1, n_clusters = 2:8)
molk_scores <- COPS::scoring(molk_res, wsum = 1)
molk_scores$all
We can see that the silhouette scores in the original spaces are quite low for both linear kernels and the pathway kernels, albeit the former are slightly higher. In breast cancer the basal-like subtype tends to separate very well, so two clusters is expected to have the best cohesion and separation. The pathway- kernels also weight features differently from the euclidean distance used to calculate the silhouettes, so it is expected to yield worse results in this regard. But the pathway-kernels can highlight more relevant genes in terms of impact which can be evaluated in terms of survival differences between clusters, for example.
To download processed TCGA clinical data, we can use the TCGAbiolinks-package.
clinical_data <- as.data.frame(TCGAbiolinks::TCGAquery_subtype("BRCA"))
The subsample_survival_evaluation
expects a data.frame that contains the
time and an event indicator. Depending on the clinical data format, it may be
necessary to pre-process the survival data. In this case the times for event
and event-free individuals are in different columns. We can use the
survival_preprocess
to reformat this data. The event also may have
to be converted to TRUE/FALSE. TCGA also has some very long follow-up times,
we can trim these by setting a cutoff.
surv_data <- COPS::survival_preprocess( event_data = clinical_data, event_time_name = "days_to_death", follow_up_time_name = "days_to_last_followup", event_field_name = "vital_status", event_name = "Dead", event_time_cutoff = 3000)
After preprocessing survival data we can call subsample_survival_evaluation
with the clustering result (clusters
) and event information
(event_data
). The time and event columns can be named in the
survival_time_col
and survival_event_col
arguments.
The event_data
-argument should also contain
an id-column (named in the row_id
-argument) that is used to match
column names of the omics data to the event data. It may be necessary to translate
the sample ids used in the omics data into patient ids. Note that missing survival
data is automatically omitted in the evaluation. subsample_survival_evaluation
fits two Cox proportional hazard models one with cluster indicators and one without and
performs a likelihood-ratio test (LRT) to assess whether the difference between clusters
is significant while allowing the Cox model to account for other critical variables
such as age or cancer stage. Covariates must be included in event_data
and can be named in the survival_covariate_names
-argument.
The function retuns a data.frame with the LRT p-value and Harrell's concordance
index for each result.
mopw_clusters_surv <- mopw_clusters mopw_clusters_surv[["id"]] <- gsub( "-[0-9]+[A-Z]+$", "", mopw_clusters_surv[["id"]]) mopw_surv_res <- COPS::subsample_survival_evaluation( event_data = surv_data, clusters = mopw_clusters_surv, survival_time_col = "time", survival_event_col = "event", survival_covariate_names = c( "age_at_initial_pathologic_diagnosis", "pathologic_stage"), row_id = "patient")
mopw_surv_res
When used within the COPS
-pipeline the returned object will be
included in the returned list under the key "survival". The scoring
-
function can then be used to generate summaries of cross-validated results
similarly to other metrics discussed in the Psoriasis-vignette.
molk_clusters_surv <- molk_res$clusters molk_clusters_surv[["id"]] <- gsub( "-[0-9]+[A-Z]+$", "", molk_clusters_surv[["id"]]) molk_surv_res <- COPS::subsample_survival_evaluation( event_data = surv_data, clusters = molk_clusters_surv, survival_time_col = "time", survival_event_col = "event", survival_covariate_names = c( "age_at_initial_pathologic_diagnosis", "pathologic_stage"), row_id = "patient")
molk_res$survival <- molk_surv_res molk_scores <- COPS::scoring(molk_res, wsum = 1, significance_level = 0.05)
molk_scores$all[,c("m", "k", "SurvivalLRtestRR", "SurvivalConcordance")]
Based on the likelihood-ratio test it seems that the data-driven clusters are not consistently significant predictors of survival. From the stability results we saw that cluster sizes greater than 3 had average stability lower than 0.8 and variations in results between subsamples of data can explain the rejection rate. The pathway result showed slightly higher concordance index for the first few clusters, but the analysis should be repeated using subsamples within the full COPS framework to be fair.
The next step after selecting the best result based on metrics is the visualisation
and characterization of the groups. COPS
provides a handful of tools
to help select informative features and visualize them. For instance,
univariate_features
runs several test for differences between
clusters for each feature.
mopw_features <- list() for (i in names(COPS::brca_multi_omics)) { mopw_features[[i]] <- COPS::univariate_features( COPS::brca_multi_omics[[i]], mopw_clusters[mopw_clusters[["k"]] == 3, "cluster"], # same order parallel = 1) }
This gives us lists of various statistical test results and statistics. We can for example use the signal-to-noise-ratio to select interesting features:
mopw_features_selected <- list() for (i in names(COPS::brca_multi_omics)) { filter_i <- apply(abs(mopw_features[[i]][["snr"]]) > 0.75, 1, any) mopw_features_selected[[i]] <- names(which(filter_i)) }
Then, for visualizing we need to create a list of matrices containing only the relevant features.
mopw_dat_selected <- list() for (i in names(COPS::brca_multi_omics)) { if (length(mopw_features_selected[[i]]) > 0) { mopw_dat_selected[[i]] <- COPS::brca_multi_omics[[i]][mopw_features_selected[[i]],] rownames(mopw_dat_selected[[i]]) <- paste0(i, "_", rownames(mopw_dat_selected[[i]])) } }
For annotating columns we can use the clinical information that we downloaded above. Annotations need to be a data.frame with rows matching the columns of the omics data.
# We need to extract the patient IDs from the sample IDs again p_id <- gsub("-[0-9]+[A-Z]+$", "", colnames(COPS::brca_multi_omics$mrna)) hm_annotations <- clinical_data[ match(p_id, clinical_data[["patient"]]), c("age_at_initial_pathologic_diagnosis", "pathologic_stage", "BRCA_Subtype_PAM50")] colnames(hm_annotations) <- c("age","stage", "PAM50") hm_annotations[["cluster"]] <- mopw_clusters[mopw_clusters[["k"]] == 3, "cluster"]
The heatmap_annotated
wraps ComplexHeatmap::Heatmap
to conveniently visualize clustering results from one or more omics. The selected
features could be given in the feature_names
argument, but it can
be easier to modify the input data manually (e.g., renaming the features).
By default, the omics data are centered but not scaled to unit variance. The
output is highly customisable, since all extra arguments are passed to
ComplexHeatmap::Heatmap
.
COPS::heatmap_annotated( mopw_dat_selected, hm_annotations, scale = TRUE, column_split = hm_annotations[["cluster"]])
sessionInfo()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.