knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
Similarity-driven multi-view linear reconstruction (SiMLR) is an algorithm that exploits inter-modality relationships to transform large scientific datasets into a smaller joint space. The link between the original data $X_i$ ($n \times p$) and the reduced embedding space is a sparse set of features $v_i$ ( $p \times k$ ). Standard statistical tools may then be applied on the embeddings $s_i=X_i v_i$.
SiMLR may be used in a variety of other ways. We will cover a basic example and follow with different use cases. These examples are perhaps less involved than those given in the original publication's cloud computing examples but illustrate new functionality.
In this section, we explore how SiMLR can be applied to multi-view datasets. We'll generate synthetic data to simulate the application of SiMLR, compare the results to traditional methods like Singular Value Decomposition (SVD), and visualize the relationships between the views. This example derives from the simulation-based evaluation in the original paper.
We begin by simulating three different views of data, each representing different modalities or datasets that share some underlying structure.
library(ANTsR) library(reshape2) set.seed(1500) nsub <- 100 # Number of subjects/samples npix <- c(100, 200, 133) # Number of features in each view nk <- 5 # Number of latent factors # Generating outcome matrices for each view outcome <- matrix(rnorm(nsub * nk), ncol = nk) outcome1 <- matrix(rnorm(nsub * nk), ncol = nk) outcome2 <- matrix(rnorm(nsub * nk), ncol = nk) outcome3 <- matrix(rnorm(nsub * nk), ncol = nk) # Generating transformation matrices for each view view1tx <- matrix(rnorm(npix[1] * nk), nrow = nk) view2tx <- matrix(rnorm(npix[2] * nk), nrow = nk) view3tx <- matrix(rnorm(npix[3] * nk), nrow = nk) # Creating the multi-view data matrices mat1 <- (outcome %*% t(outcome1) %*% (outcome1)) %*% view1tx mat2 <- (outcome %*% t(outcome2) %*% (outcome2)) %*% view2tx mat3 <- (outcome %*% t(outcome3) %*% (outcome3)) %*% view3tx colnames(mat1)=paste0("m1.",1:ncol(mat1)) colnames(mat2)=paste0("m2.",1:ncol(mat2)) colnames(mat3)=paste0("m3.",1:ncol(mat3)) # Combine the matrices into a list matlist <- list(m1 = mat1, m2 = mat2, m3 = mat3)
In this code, mat1
, mat2
, and mat3
represent three views of the data. Each view is constructed to share a common underlying structure but with different transformations applied.
Now we apply the SiMLR algorithm to the simulated multi-view data to find a common representation across the views.
# Applying SiMLR prepro=c( "centerAndScale","np") constr='Stiefelx100x1' constr='orthox100x0.1' constr='Grassmannx100x1' result <- simlr(matlist,constraint=constr,scale=prepro, # mixAlg='pca', energyType='cca', # recommended mixAlg='ica', energyType='regression', # recommended sparsenessAlg = "spmp", iterations=100, verbose=TRUE) # Projecting data into the reduced space p1 <- mat1 %*% (result$v[[1]]) p2 <- mat2 %*% (result$v[[2]]) p3 <- mat3 %*% (result$v[[3]])
SiMLR identifies a sparse set of features that best reconstruct each view, resulting in projections p1
, p2
, and p3
for each view.
To understand the effectiveness of the SiMLR projections, we can compare the correlations between the projections of different views. Additionally, we’ll compare the results to those obtained via Singular Value Decomposition (SVD).
# Calculate SVD for comparison svd1 <- svd(mat1, nu = nk, nv = 0)$u svd2 <- svd(mat2, nu = nk, nv = 0)$u svd3 <- svd(mat3, nu = nk, nv = 0)$u # Calculate correlations between the SiMLR projections cor_p1_p2 <- range(cor(p1, p2)) cor_p1_p3 <- range(cor(p1, p3)) cor_p2_p3 <- range(cor(p2, p3)) # Compare with correlations from SVD cor_svd1_svd2 <- range(cor(svd1, svd2)) # Print the results cat("Correlation between p1 and p2:", cor_p1_p2, "\n") cat("Correlation between p1 and p3:", cor_p1_p3, "\n") cat("Correlation between p2 and p3:", cor_p2_p3, "\n") cat("Correlation between SVD1 and SVD2:", cor_svd1_svd2, "\n")
Visualizing these correlations helps us understand the similarity between the views after the SiMLR transformation.
library(ggplot2) # Function to plot correlation heatmaps plot_cor_heatmap <- function(mat1, mat2, title) { cor_mat <- abs(cor(mat1,mat2)) ggplot(melt(cor_mat), aes(Var1, Var2, fill = value)) + geom_tile() + scale_fill_gradient2(midpoint = 0.5, low = "blue", high = "red", mid = "white") + theme_minimal() + labs(title = title, x = "Variables", y = "Variables") + coord_fixed() } # Plotting the correlations plot_cor_heatmap(p1, p2, "Correlation Heatmap of p1-p2") plot_cor_heatmap(p1, p3,"Correlation Heatmap of p1-p3") plot_cor_heatmap(p2, p3, "Correlation Heatmap of p2-p3")
These heatmaps show how the reduced representations from SiMLR correlate within each view. Comparing these with similar plots for SVD will reveal whether SiMLR captures more meaningful relationships between the views.
A permutation test can assess whether the observed correlations are significant.
s1 <- sample(1:nsub) s2 <- sample(1:nsub) permMats=list(vox = mat1, vox2 = mat2[s1, ], vox3 = mat3[s2, ]) resultp <- simlr(permMats, constraint=constr,scale=prepro, sparsenessAlg = "spmp" ) p1p <- mat1 %*% (resultp$v[[1]]) p2p <- mat2[s1, ] %*% (resultp$v[[2]]) p3p <- mat3[s2, ] %*% (resultp$v[[3]]) # Compare the permuted correlations cor_p1p_p2p <- range(cor(p1p, p2p)) cor_p1p_p3p <- range(cor(p1p, p3p)) cor_p2p_p3p <- range(cor(p2p, p3p)) # Print permuted results cat("Permuted Correlation between p1p and p2p:", cor_p1p_p2p, "\n") cat("Permuted Correlation between p1p and p3p:", cor_p1p_p3p, "\n") cat("Permuted Correlation between p2p and p3p:", cor_p2p_p3p, "\n")
The permutation test will show if the observed correlations are stronger than those expected by chance.
This introductory example demonstrates how SiMLR can uncover the shared structure across multiple views of data. The correlations between the projections indicate the degree to which the algorithm has captured this shared structure, and the permutation test serves as a validation step.
SiMLR is particularly useful when working with datasets from different modalities (e.g., genomics, proteomics, and imaging data). By finding a joint embedding space, SiMLR can integrate these diverse data types into a unified representation, facilitating downstream analyses such as clustering, classification, or regression.
When dealing with high-dimensional data, dimensionality reduction techniques are often required to make the data more manageable and to avoid issues like the curse of dimensionality. SiMLR provides an approach that not only reduces dimensionality but also maintains the relationships between different views of the data, making it a powerful tool for exploratory data analysis and visualization. Below, we illustrate reading, writing and exploratory integrated visualization.
library(fpc) library(cluster) library(gridExtra) library(ggpubr) sim2nm=tempfile() write_simlr_data_frames( result$v, sim2nm ) simres2=read_simlr_data_frames( sim2nm, names(matlist) ) popdf = data.frame( age = outcome, cog=outcome1, mat1, mat2, mat3 ) temp2=apply_simlr_matrices( popdf, simres2, center=TRUE, scale=TRUE, absolute_value=rep(TRUE,length(matlist)) ) simnames=temp2[[2]] zz=exploratory_visualization( temp2[[1]][, temp2[[2]]], dotsne=FALSE ) print( zz$plot )
Now we can icorporate the "modalities" in order to predict the simulated outcome matrix. Note that contributions exist from each modality in the regression (in some cases). This is one of the advantages of SiMLR: it provides systematic guidance for building these types of integrative models.
# multi-view regression summary( lm( age.1 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] )) summary( lm( age.2 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] )) summary( lm( age.3 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] )) summary( lm( age.4 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] )) summary( lm( age.5 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] ))
SiMLR incorporates sparse feature selection, allowing users to identify a minimal subset of features that contribute most to the joint embedding space. This can be particularly advantageous in settings where interpretability is crucial, such as biomarker discovery in biological datasets.
pp=plot_features( simres2 ) grid.arrange( grobs=pp, nrow=3, top='Joint features' )
By projecting the data into a joint space, SiMLR helps improve the interpretability of machine learning models. The sparse feature selection ensures that only the most informative features are retained, making it easier to understand the relationships between the input data and the model's predictions.
In scenarios where you have multiple data modalities but missing information scattered across some of the views, SiMLR can be used to impute the missing data by leveraging the relationships between the available modalities. This cross-modal prediction capability is particularly useful in multi-omics studies and other fields where complete data is often unavailable.
popdfi=popdf proportion <- 0.1 num_NAs <- ceiling(proportion * nrow(popdfi) * ncol(popdfi)) indices <- arrayInd(sample(1:(nrow(popdfi) * ncol(popdfi)), num_NAs), .dim = dim(popdfi)) popdfi[indices] <- NA nms=names(simres2) nsimx=3 sep='.' imputedcols=c() for ( n in nms ) { for ( v in 1:nsimx ) { thiscol=paste0(n,sep,v) if ( any( is.na( popdfi[,thiscol] ) )) { imputedcols=c(imputedcols,thiscol) popdfi = simlr_impute( popdfi, nms, v, n, separator=sep ) } } } impresult=data.frame( og=popdf[,imputedcols[1]], imp=popdfi[,imputedcols[1]]) ggscatter( impresult, 'og', 'imp' ) + ggtitle(paste("Imputed vs original data",imputedcols[1]))
We use a metric (rvcoef
) that is not directly optimized to assess the significance of the simlr result versus permuted data. The RV coefficient is analogous to the Pearson correlation coefficient but is used for comparing two matrices (or datasets) rather than two variables. It measures the similarity of the column spaces (subspaces) spanned by the columns of two matrices. We compute the rvcoef
across all pairs of data at each permutation and compare to the original values.
initu = initializeSimlr( matlist, 3, jointReduction = TRUE ) np=40 myperm = simlr.perm( matlist, constraint=constr,scale=prepro, initialUMatrix=initu, sparsenessAlg = "spmp", nperms=np, FUN=rvcoef, verbose=TRUE ) print( tail( myperm$significance, 2 ) ) gglist = list() simlrmaps=colnames(myperm$significance)[-c(1:2)] for ( xxx in simlrmaps ) { pvec=myperm$significance[ myperm$significance$perm %in% as.character(1:np),xxx] original_tstat <- myperm$significance[1,xxx] # replace with actual value gglist[[length(gglist)+1]]=( visualize_permutation_test( pvec, original_tstat, xxx ) ) } grid.arrange( grobs=gglist, top='Cross-modality relationships: permutation tested')
# display simlr results gglist2=list() for ( i in 1:(length(matlist)-1)) { for ( j in ((i+1):length(matlist))) { toviz1=data.matrix(simres2[[i]]) toviz2=data.matrix(simres2[[j]]) temp=visualize_lowrank_relationships( matlist[[i]], matlist[[j]], toviz1, toviz2, nm1=names(matlist)[i], nm2=names(matlist)[j] )$plot gglist2[[length(gglist2)+1]]=temp } } grid.arrange(grobs=(gglist2),nrow=1, top='Low-rank correlations: SiMLR')
This section of the tutorial vignette provides an example of how to set up and run a parameter search using the simlr.search function. Here, csearch
contains a list of constraint options that include "none" and different options for enforcing orthogonality in the feature space. This list will be used to explore different constraint options during the parameter search.
In this step, the simlr.parameters function is called to create a parameter list (simparms
) that will control the grid search. Each parameter category (e.g., nsimlr_options
, prescaling_options
, etc.) is provided with options as a list. These parameters specify the settings and constraints that will be explored during the optimization process.
nsimlr_options
: The number of latent factors to consider.
prescaling_options
: Methods for data scaling/preprocessing (robust, whiten, np).
objectiver_options
: The objective function used (eg regression or cca).
mixer_options
: The mixing technique (ica or pca).
sparval_options
: Sparseness settings.
expBeta_options
: Exponential smoothing parameter beta values. e.g. 0.0, 0.9, 0.99, 0.999.
positivities_options
: Positivity constraints.
optimus_options
: The optimization strategy (lineSearch, mixed, greedy).
constraint_options
: Feature space orthogonality constraints to apply.
search_type
: The type of search to perform (full indicates a full grid search).
num_samples
: The number of samples to evaluate during the search.
paster <- function(vec1, vec2) { paste0(rep(vec1, each = length(vec2)), vec2) } csearch = as.list( c("none", paster( c("Grassmannx","Stiefelx","orthox"), c("10x10","100x1","100x1","1x1")))) # the 10x10 are weights on the optimization term of the energy simparms = simlr.parameters( nsimlr_options = list( 2 ), prescaling_options = list( c( "robust", "centerAndScale", "np") ), objectiver_options = list("regression",'cca'), mixer_options=list("ica",'pca'), sparval_options=list( c(0.5,0.5,0.5) ), expBeta_options = list( c(0.99) ), positivities_options = list( c("positive","positive","positive")), optimus_options=list( 'lineSearch'), constraint_options=csearch, sparsenessAlg = "spmp", search_type="random", num_samples=5 )
The regularizeSimlr
function is used to apply regularization to the input matrices matlist. Regularization can help prevent overfitting by penalizing complexity. The fraction and sigma parameters control the extent and type of regularization.
regs = regularizeSimlr(matlist, fraction=0.05, sigma=rep(1.5,length(matlist))) plot( image( regs[[1]] ) ) plot( image( regs[[2]] ) )
The simlr.search
function performs the grid search over the parameter space defined in simparms
. It takes the matrices matlist and regs as input. The nperms
specifies the number of permutations to evaluate for robustness. The verbose
parameter controls the level of detail in the output during the search process.
simlrXs=simlr.search( matlist, regs=regs, simparms, nperms=12, verbose=4 ) simlrXs$paramsearch
Finally, simlrXs$paramsearch
retrieves the results of the parameter search. This will show the performance of different parameter combinations, helping to identify the optimal settings for the SiMLR algorithm. The "best" result here will have the highest value for final_energy
and its parameters are stored in the parameters
slot.
This section of the vignette demonstrates how to set up a comprehensive parameter search for the SiMLR algorithm. It shows how to create custom constraints, define a parameter grid, regularize the model, and execute the search. The result is an optimized set of parameters that can be used to enhance the performance of the SiMLR model on your data.
SIMLR has a path modeling variant that lets you build the graph of connections however you like. Call the matrices A, B and C. If you build: A=>A, B=>B, C=>C then you will learn within-modality feature sets. If you build A=>(B,C), B=>A, C=>A then you will “focus” on A. The default graph is A=>(B,C), B=>(A,C), C=>(A,B) ]
We provided a quick overview of potential uses of the SiMLR framework.
In addition, we provided visualization hints and standards that may be
of use in scientific applications of these methods.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.