knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Similarity-driven multi-view linear reconstruction

Introduction

Similarity-driven multi-view linear reconstruction (SiMLR) is an algorithm that exploits inter-modality relationships to transform large scientific datasets into a smaller joint space. The link between the original data $X_i$ ($n \times p$) and the reduced embedding space is a sparse set of features $v_i$ ( $p \times k$ ). Standard statistical tools may then be applied on the embeddings $s_i=X_i v_i$.

SiMLR may be used in a variety of other ways. We will cover a basic example and follow with different use cases. These examples are perhaps less involved than those given in the original publication's cloud computing examples but illustrate new functionality.

Example: Applying SiMLR to Simulated Multi-View Data

In this section, we explore how SiMLR can be applied to multi-view datasets. We'll generate synthetic data to simulate the application of SiMLR, compare the results to traditional methods like Singular Value Decomposition (SVD), and visualize the relationships between the views. This example derives from the simulation-based evaluation in the original paper.

Step 1: Simulate Multi-View Data

We begin by simulating three different views of data, each representing different modalities or datasets that share some underlying structure.

library(ANTsR)
library(reshape2)
set.seed(1500)
nsub <- 100          # Number of subjects/samples
npix <- c(100, 200, 133)  # Number of features in each view
nk <- 5             # Number of latent factors

# Generating outcome matrices for each view
outcome <- matrix(rnorm(nsub * nk), ncol = nk)
outcome1 <- matrix(rnorm(nsub * nk), ncol = nk)
outcome2 <- matrix(rnorm(nsub * nk), ncol = nk)
outcome3 <- matrix(rnorm(nsub * nk), ncol = nk)

# Generating transformation matrices for each view
view1tx <- matrix(rnorm(npix[1] * nk), nrow = nk)
view2tx <- matrix(rnorm(npix[2] * nk), nrow = nk)
view3tx <- matrix(rnorm(npix[3] * nk), nrow = nk)

# Creating the multi-view data matrices
mat1 <- (outcome %*% t(outcome1) %*% (outcome1)) %*% view1tx
mat2 <- (outcome %*% t(outcome2) %*% (outcome2)) %*% view2tx
mat3 <- (outcome %*% t(outcome3) %*% (outcome3)) %*% view3tx
colnames(mat1)=paste0("m1.",1:ncol(mat1))
colnames(mat2)=paste0("m2.",1:ncol(mat2))
colnames(mat3)=paste0("m3.",1:ncol(mat3))

# Combine the matrices into a list
matlist <- list(m1 = mat1, m2 = mat2, m3 = mat3)

In this code, mat1, mat2, and mat3 represent three views of the data. Each view is constructed to share a common underlying structure but with different transformations applied.

Step 2: Apply SiMLR

Now we apply the SiMLR algorithm to the simulated multi-view data to find a common representation across the views.

# Applying SiMLR
prepro=c( "centerAndScale","np")
constr='Stiefelx100x1'
constr='orthox100x0.1'
constr='Grassmannx100x1'
result <- simlr(matlist,constraint=constr,scale=prepro, 
#  mixAlg='pca', energyType='cca', # recommended
  mixAlg='ica', energyType='regression', # recommended
  sparsenessAlg = "spmp",
  iterations=100, verbose=TRUE)

# Projecting data into the reduced space
p1 <- mat1 %*% (result$v[[1]])
p2 <- mat2 %*% (result$v[[2]])
p3 <- mat3 %*% (result$v[[3]])

SiMLR identifies a sparse set of features that best reconstruct each view, resulting in projections p1, p2, and p3 for each view.

Step 3: Visualization and Comparison

To understand the effectiveness of the SiMLR projections, we can compare the correlations between the projections of different views. Additionally, we’ll compare the results to those obtained via Singular Value Decomposition (SVD).

# Calculate SVD for comparison
svd1 <- svd(mat1, nu = nk, nv = 0)$u
svd2 <- svd(mat2, nu = nk, nv = 0)$u
svd3 <- svd(mat3, nu = nk, nv = 0)$u

# Calculate correlations between the SiMLR projections
cor_p1_p2 <- range(cor(p1, p2))
cor_p1_p3 <- range(cor(p1, p3))
cor_p2_p3 <- range(cor(p2, p3))

# Compare with correlations from SVD
cor_svd1_svd2 <- range(cor(svd1, svd2))

# Print the results
cat("Correlation between p1 and p2:", cor_p1_p2, "\n")
cat("Correlation between p1 and p3:", cor_p1_p3, "\n")
cat("Correlation between p2 and p3:", cor_p2_p3, "\n")
cat("Correlation between SVD1 and SVD2:", cor_svd1_svd2, "\n")

Visualizing Correlations

Visualizing these correlations helps us understand the similarity between the views after the SiMLR transformation.

library(ggplot2)

# Function to plot correlation heatmaps
plot_cor_heatmap <- function(mat1, mat2, title) {
  cor_mat <- abs(cor(mat1,mat2))
  ggplot(melt(cor_mat), aes(Var1, Var2, fill = value)) +
    geom_tile() +
    scale_fill_gradient2(midpoint = 0.5, low = "blue", high = "red", mid = "white") +
    theme_minimal() +
    labs(title = title, x = "Variables", y = "Variables") +
    coord_fixed()
}

# Plotting the correlations
plot_cor_heatmap(p1, p2, "Correlation Heatmap of p1-p2")
plot_cor_heatmap(p1, p3,"Correlation Heatmap of p1-p3")
plot_cor_heatmap(p2, p3, "Correlation Heatmap of p2-p3")

These heatmaps show how the reduced representations from SiMLR correlate within each view. Comparing these with similar plots for SVD will reveal whether SiMLR captures more meaningful relationships between the views.

Step 4: Permutation Test

A permutation test can assess whether the observed correlations are significant.

s1 <- sample(1:nsub)
s2 <- sample(1:nsub)
permMats=list(vox = mat1, vox2 = mat2[s1, ], vox3 = mat3[s2, ])
resultp <- simlr(permMats,
  constraint=constr,scale=prepro,  sparsenessAlg = "spmp" )
p1p <- mat1 %*% (resultp$v[[1]])
p2p <- mat2[s1, ] %*% (resultp$v[[2]])
p3p <- mat3[s2, ] %*% (resultp$v[[3]])

# Compare the permuted correlations
cor_p1p_p2p <- range(cor(p1p, p2p))
cor_p1p_p3p <- range(cor(p1p, p3p))
cor_p2p_p3p <- range(cor(p2p, p3p))

# Print permuted results
cat("Permuted Correlation between p1p and p2p:", cor_p1p_p2p, "\n")
cat("Permuted Correlation between p1p and p3p:", cor_p1p_p3p, "\n")
cat("Permuted Correlation between p2p and p3p:", cor_p2p_p3p, "\n")

The permutation test will show if the observed correlations are stronger than those expected by chance.

This introductory example demonstrates how SiMLR can uncover the shared structure across multiple views of data. The correlations between the projections indicate the degree to which the algorithm has captured this shared structure, and the permutation test serves as a validation step.

Integration of Multiple Data Types via Dimensionality Reduction

SiMLR is particularly useful when working with datasets from different modalities (e.g., genomics, proteomics, and imaging data). By finding a joint embedding space, SiMLR can integrate these diverse data types into a unified representation, facilitating downstream analyses such as clustering, classification, or regression.

When dealing with high-dimensional data, dimensionality reduction techniques are often required to make the data more manageable and to avoid issues like the curse of dimensionality. SiMLR provides an approach that not only reduces dimensionality but also maintains the relationships between different views of the data, making it a powerful tool for exploratory data analysis and visualization. Below, we illustrate reading, writing and exploratory integrated visualization.

library(fpc)
library(cluster)
library(gridExtra)
library(ggpubr)
sim2nm=tempfile()
write_simlr_data_frames( result$v, sim2nm  )
simres2=read_simlr_data_frames( sim2nm, names(matlist) )
popdf = data.frame( age = outcome, cog=outcome1, mat1, mat2, mat3 )
temp2=apply_simlr_matrices( popdf, simres2, 
    center=TRUE, scale=TRUE, absolute_value=rep(TRUE,length(matlist)) )
simnames=temp2[[2]]
zz=exploratory_visualization( temp2[[1]][, temp2[[2]]], dotsne=FALSE )
print( zz$plot )

Now we can icorporate the "modalities" in order to predict the simulated outcome matrix. Note that contributions exist from each modality in the regression (in some cases). This is one of the advantages of SiMLR: it provides systematic guidance for building these types of integrative models.

# multi-view regression
summary( lm( age.1 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] ))
summary( lm( age.2 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] ))
summary( lm( age.3 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] ))
summary( lm( age.4 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] ))
summary( lm( age.5 ~ m1PC1 + m2PC1 + m3PC1, data=temp2[[1]] ))

Sparse Feature Selection

SiMLR incorporates sparse feature selection, allowing users to identify a minimal subset of features that contribute most to the joint embedding space. This can be particularly advantageous in settings where interpretability is crucial, such as biomarker discovery in biological datasets.

pp=plot_features( simres2 )
grid.arrange( grobs=pp, nrow=3, top='Joint features' )

Improved Model Interpretability

By projecting the data into a joint space, SiMLR helps improve the interpretability of machine learning models. The sparse feature selection ensures that only the most informative features are retained, making it easier to understand the relationships between the input data and the model's predictions.

Cross-Modal Prediction or Imputation

In scenarios where you have multiple data modalities but missing information scattered across some of the views, SiMLR can be used to impute the missing data by leveraging the relationships between the available modalities. This cross-modal prediction capability is particularly useful in multi-omics studies and other fields where complete data is often unavailable.

popdfi=popdf
proportion <- 0.1
num_NAs <- ceiling(proportion * nrow(popdfi) * ncol(popdfi))
indices <- arrayInd(sample(1:(nrow(popdfi) * ncol(popdfi)), num_NAs), .dim = dim(popdfi))
popdfi[indices] <- NA
nms=names(simres2)
nsimx=3
sep='.'
imputedcols=c()
for ( n in nms ) {
  for ( v in 1:nsimx ) {
    thiscol=paste0(n,sep,v)
    if ( any( is.na( popdfi[,thiscol] ) )) {
      imputedcols=c(imputedcols,thiscol)
      popdfi = simlr_impute( popdfi, nms, v, n, separator=sep )
    }
  }
}
impresult=data.frame( og=popdf[,imputedcols[1]], imp=popdfi[,imputedcols[1]])
ggscatter( impresult, 'og', 'imp' ) + ggtitle(paste("Imputed vs original data",imputedcols[1]))

Testing Significance of Joint Relationships

We use a metric (rvcoef) that is not directly optimized to assess the significance of the simlr result versus permuted data. The RV coefficient is analogous to the Pearson correlation coefficient but is used for comparing two matrices (or datasets) rather than two variables. It measures the similarity of the column spaces (subspaces) spanned by the columns of two matrices. We compute the rvcoef across all pairs of data at each permutation and compare to the original values.

initu = initializeSimlr( matlist, 3, jointReduction = TRUE  )
np=40
myperm = simlr.perm(  matlist,
  constraint=constr,scale=prepro, 
  initialUMatrix=initu,
  sparsenessAlg = "spmp",
  nperms=np, FUN=rvcoef, verbose=TRUE )
print( tail( myperm$significance, 2 )  )
gglist = list()
simlrmaps=colnames(myperm$significance)[-c(1:2)]
for ( xxx in simlrmaps ) {
    pvec=myperm$significance[ myperm$significance$perm %in% as.character(1:np),xxx]
    original_tstat <- myperm$significance[1,xxx] # replace with actual value
    gglist[[length(gglist)+1]]=( visualize_permutation_test( pvec, original_tstat, xxx ) )
}
grid.arrange( grobs=gglist, top='Cross-modality relationships: permutation tested')
# display simlr results
gglist2=list()
for ( i in 1:(length(matlist)-1)) {
    for ( j in ((i+1):length(matlist))) {
        toviz1=data.matrix(simres2[[i]])
        toviz2=data.matrix(simres2[[j]])
        temp=visualize_lowrank_relationships( matlist[[i]], matlist[[j]], 
            toviz1, toviz2,
            nm1=names(matlist)[i], nm2=names(matlist)[j] )$plot
        gglist2[[length(gglist2)+1]]=temp
    }
}
grid.arrange(grobs=(gglist2),nrow=1, top='Low-rank correlations: SiMLR')

Evaluating Machine Learning Objective Functions

This section of the tutorial vignette provides an example of how to set up and run a parameter search using the simlr.search function. Here, csearch contains a list of constraint options that include "none" and different options for enforcing orthogonality in the feature space. This list will be used to explore different constraint options during the parameter search.

In this step, the simlr.parameters function is called to create a parameter list (simparms) that will control the grid search. Each parameter category (e.g., nsimlr_options, prescaling_options, etc.) is provided with options as a list. These parameters specify the settings and constraints that will be explored during the optimization process.

paster <- function(vec1, vec2) {
  paste0(rep(vec1, each = length(vec2)), vec2)
}
csearch = as.list( c("none", 
  paster( c("Grassmannx","Stiefelx","orthox"), 
    c("10x10","100x1","100x1","1x1"))))
# the 10x10 are weights on the optimization term of the energy
simparms = simlr.parameters( 
    nsimlr_options = list( 2 ),
    prescaling_options = list( 
            c( "robust", "centerAndScale",  "np") ),
    objectiver_options = list("regression",'cca'),
    mixer_options=list("ica",'pca'), 
    sparval_options=list(  c(0.5,0.5,0.5) ),
    expBeta_options = list( c(0.99) ),
    positivities_options = list(
            c("positive","positive","positive")),
    optimus_options=list( 'lineSearch'),
    constraint_options=csearch,
    sparsenessAlg = "spmp",
    search_type="random",
    num_samples=5
    )

The regularizeSimlr function is used to apply regularization to the input matrices matlist. Regularization can help prevent overfitting by penalizing complexity. The fraction and sigma parameters control the extent and type of regularization.

regs = regularizeSimlr(matlist, fraction=0.05, 
  sigma=rep(1.5,length(matlist)))
plot( image( regs[[1]] ) )
plot( image( regs[[2]] ) )

The simlr.search function performs the grid search over the parameter space defined in simparms. It takes the matrices matlist and regs as input. The nperms specifies the number of permutations to evaluate for robustness. The verbose parameter controls the level of detail in the output during the search process.

simlrXs=simlr.search(
        matlist,
        regs=regs,
        simparms,
        nperms=12,
        verbose=4
        )
simlrXs$paramsearch

Finally, simlrXs$paramsearch retrieves the results of the parameter search. This will show the performance of different parameter combinations, helping to identify the optimal settings for the SiMLR algorithm. The "best" result here will have the highest value for final_energy and its parameters are stored in the parameters slot.

This section of the vignette demonstrates how to set up a comprehensive parameter search for the SiMLR algorithm. It shows how to create custom constraints, define a parameter grid, regularize the model, and execute the search. The result is an optimized set of parameters that can be used to enhance the performance of the SiMLR model on your data.

Path modeling

SIMLR has a path modeling variant that lets you build the graph of connections however you like. Call the matrices A, B and C. If you build: A=>A, B=>B, C=>C then you will learn within-modality feature sets. If you build A=>(B,C), B=>A, C=>A then you will “focus” on A. The default graph is A=>(B,C), B=>(A,C), C=>(A,B) ]

Summary

We provided a quick overview of potential uses of the SiMLR framework.
In addition, we provided visualization hints and standards that may be of use in scientific applications of these methods.



neuroconductor-devel/ANTsR documentation built on July 3, 2025, 10:41 p.m.