In recent years, competing endogenous RNA (ceRNA) networks have established themselves as promising tools to identify biomarkers and inspired the formulation of new hypotheses regarding the post-transcriptional regulatory role of microRNAs (miRNAs). While different studies have proven the potential of these regulatory networks, current methods only allow the study of their global characteristics. Here, we introduce spongEffects, a novel method that infers subnetworks from pre-calculated ceRNA networks and calculates patient/sample-specific scores related to their regulatory activity. Notably, these module scores can be inferred from gene expression data alone and can thus be applied to cohorts where miRNA expression information is lacking. In this vignette, we show how spongEffects can be used for ceRNA module identification and enrichment. We do so by showcasing its use in a reduced breast cancer dataset (TCGA-BRCA). Moreover, we illustrate how the identified modules can be exploited for further downstream subtype classification tasks and identification of biologically meaningfull ceRNA modules.

spongEffects has been developed as an add-on to the SPONGE package (List et al. 2019, https://www.bioconductor.org/packages/release/bioc/html/SPONGE.html). SPONGE allows the inference of robust ceRNA networks from gene and miRNA expression data.

Further details demonstrating the implementation and use of spongEffects for stratification and biomarker identification are available in the related manuscript (currently available as preprint at: https://doi.org/10.1101/2022.03.29.486212).

(A) Loading dependencies for spongEffects

We start with loading the package and its dependencies. spongEffects core functions allow for the registration of a parallel backend (if desired).

library(SPONGE)
library(doParallel)
library(foreach)
library(dplyr)

# Register your backend here
num.of.cores <- 4
cl <- makeCluster(num.of.cores) 
registerDoParallel(cl)

(B) Formats of the inputs necessary for spongEffects

spongEffects comes with a very small example dataset useful for illustrating the functionalities of the package. The data were originally part of the TCGA-BRCA cohort. We downsized it to reduce the computational time required by this vignette.

We provide two gene and miRNA expression datasets together with related metadata, to simulate an optimal scenario where a train set and a test set are available. We also provide a small ceRNA network, gene-miRNA candidates, ceRNA network centrality measures (the ceRNA network and the centrality measures were created using the SPONGE vignette). While these are not required to run spongEffects (only gene expression and a pre-computed ceRNA network are), we want to showcase how to conduct a full downstream analysis of the ceRNA modules.

spongEffects requires a gene x sample expression matrix as input, with gene names as rownames. The type of gene identifier is up to the user but must be consistent between gene expression and ceRNA network. In this vignette we use Ensembl gene IDs.

The example datasets can be accessed once the SPONGE package is loaded:

Gene expression train set:

knitr::kable(train_cancer_gene_expr[1:5,1:8])

Gene expression tes set:

knitr::kable(test_cancer_gene_expr[1:5,1:8])

miRNA expression train set:

knitr::kable(train_cancer_mir_expr[1:5,1:8])

miRNA expression test set:

knitr::kable(test_cancer_mir_expr[1:5,1:8])

ceRNA network

We use a ceRNA network computed with SPONGE (List et al. 2019) for the TCGA-BRCA dataset. The ceRNA network was downsized for this vignette.

ceRNA networks and related descriptive statistics computed for 22 TCGA datasets can be downloaded from SPONGEdb (Hoffmann et al, 2021): https://exbio.wzw.tum.de/sponge/home

knitr::kable(train_ceRNA_interactions[1:5,1:8])

ceRNA networks downloaded from spongeDB come with centrality measures and other information specific to the downloaded network. Centrality measures are going to be used to define sponge modules.

knitr::kable(train_network_centralities[1:5,1:5])

Once calculated, spongEffects scores can be used for further downstream machine learning tasks. In this vignette, we will show how to use them to classifiy samples into different breast cancer subtypes (LuminalA, LuminalB, HER2+, Normal-like, and Basal). The user can use any metadata available for their own data.

Metadata train set

knitr::kable(train_cancer_metadata[1:5,1:8])

Metadata test set

knitr::kable(test_cancer_metadata[1:5,1:8])

(C) Filter ceRNA network and add weighted centrality measures information

Computationally inferred ceRNA networks often identify spurious associations between RNAs. Therefore, it is good practice to filter out ceRNA networks for weakly associated edges. More at List et al. 2019 and related vignette. In this example, we filter the ceRNA network for size effects (i.e. mscor) and statistical significance. Next, we want to identify the most important nodes in the network as seed genes for ceRNA modules. Various network centrality measures have been proposed in the literature to assess node importance. In the spongEffects paper and in this example, we used a weighted centrality, i.e. we consider the sum of weights of edges attached to a node rather than just the node degree. Node centralities can be computed via the sponge_node_centralities() function. The thresholds used here are purely indicative and are likely to be modified when real data are used. If you want to use the centralities of the unfiltered network, you can add your centralities to the Node_Centrality parameter of the filter_ceRNA_network() function. The centralities will be filtered and returned accordingly. If you want the centralities newly caculated based on the filtered network, you need to use the sponge_node_centralities() function on the filtered ceRNA network returned from the filter_ceRNA_network() function.

filtered_network_centralities=filter_ceRNA_network(sponge_effects = train_ceRNA_interactions, Node_Centrality = train_network_centralities,add_weighted_centrality=T, mscor.threshold = 0.01, padj.threshold = 0.1)

(D) Discover modules

Different classes of RNAs can be investigated as central nodes for the SPONGE modules. The user can identify their own class of interest, by modifying the filtering function or by using their own gene names. The data.frame loaded with the package contains gene ensemble IDs for protein coding, long non-coding, and circular RNAs as downloaded from biomaRt (version 2.50.3, see https://www.bioconductor.org/packages/release/bioc/vignettes/biomaRt/inst/doc/biomaRt.html). Finally, it is possible to determine the type of degree centrality used to rank the importance of potential central nodes. Possibilities are: degree, eigenvector, betweenness, page_rank, or Weighted_Degree

Different number of central nodes can be selected via the cutoff parameter.

RNAs <- c("lncRNA","protein_coding")
RNAs.ofInterest <- ensembl.df %>% dplyr::filter(gene_biotype %in% RNAs) %>%
  dplyr::select(ensembl_gene_id)

central_gene_modules<-get_central_modules(central_nodes = RNAs.ofInterest$ensembl_gene_id,node_centrality = filtered_network_centralities$Node_Centrality,ceRNA_class = RNAs, centrality_measure = "Weighted_Degree", cutoff = 10)

(E) Define SPONGE modules

We define SPONGE modules based on a first-neighbout approach, taking into account all classes of potential RNAs around the central nodes identified above. We give the possibility to the user to consider the central node as part of the module or not (remove.central).

Sponge.modules <- define_modules(network = filtered_network_centralities$Sponge.filtered, central.modules = central_gene_modules, remove.central = F, set.parallel = F)
# Module size distribution
Size.modules <- sapply(Sponge.modules, length)

(F) Calculate Enrichment Scores (i.e., spongEffects scores)

We implemented three different single sample gene set enrichment approaches to score the modules per sample, Overall Enrichment (OE), single-sample Gene Set Enrichment Analysis (ssGSEA), and Gene Set Variation Analysis (GSVA). The choice of the optimal method may vary by the data set but when we compared the three algorithms in the spongEffects paper we did not observe large differences.

Here, we calculate the spongEffects scores both for the train and test dataset. In a real word scenarios, these are expected to be generated separately. SPONGE modules inferred on one dataset are expected to generalize well to new cohorts, as described in the original publication.

For this example, we run the OE algorithm

train.modules <- enrichment_modules(Expr.matrix = train_cancer_gene_expr, modules = Sponge.modules, bin.size = 10, min.size = 1, max.size = 2000, min.expr = 1, method = "OE", cores=1)
test.modules <-  enrichment_modules(Expr.matrix = test_cancer_gene_expr, modules = Sponge.modules, bin.size = 10, min.size = 1, max.size = 2000, min.expr = 1, method = "OE", cores=1)

(G) Machine learning

spongEffects scores have a tabular format (module x patient) and can thus be used for downstream analysis tasks as, for example, classification or regression. As a typical application case, we show how to calibrate a subtype classification model. In particular, we train a random forest via k-fold repeated cross validation. The resulting object contains the trained model and a confusion matrix evaluated on the train set.

N.B. In order to use a validate a model calibrated with the caret package (as done here), it is necessary that the test set contains the same input features (i.e., SPONGE modules here) as the train set.

# We find modules that were identified both in the train and test and use those as input features for the model
common_modules = intersect(rownames(train.modules), rownames(test.modules))
train.modules = train.modules[common_modules, ]
test.modules = test.modules[common_modules, ]

trained.model = calibrate_model(Input = train.modules, modules_metadata = train_cancer_metadata, label = "SUBTYPE", sampleIDs = "sampleID",Metric = "Exact_match", n_folds = 2, repetitions = 1)

trained.model[["ConfusionMatrix_training"]]

As in a typical scenario, the model performances can be evaluated on the test set to test for generalization performances of the model.

Input.test <- t(test.modules) %>% scale(center = T, scale = T)
Prediction.model <- predict(trained.model$Model, Input.test)

# We compute the confusion matrix on the test set
ConfusionMatrix_testing <- caret::confusionMatrix(as.factor(Prediction.model), as.factor(test_cancer_metadata$SUBTYPE))
trained.model$ConfusionMatrix_testing<-ConfusionMatrix_testing
ConfusionMatrix_testing

(J) Define random modules and calculate spongEffects scores

In general, it is good idea to check the performance of a model calibrated on SPONGE modules against the one of a model calibrated on randomly defined modules. We offer a function that calculates random modules and related enrichment scores. The resulting modules have the same size distribution of the original ones.

# Define random modules
Random.modules <- Random_spongEffects(sponge_modules = Sponge.modules,
                                      gene_expr = train_cancer_gene_expr, min.size = 1,bin.size = 10, max.size = 200,
                                      min.expression=1, replace = F,method = "OE",cores = 1)
# We can now use the randomly defined modules to calculate their enrichment in the test set
Random.modules.test <- enrichment_modules(Expr.matrix = test_cancer_gene_expr, modules = Random.modules$Random_Modules, bin.size = 10, min.size = 1, max.size = 2000, min.expr = 1, method = "OE", cores=1)

Train classification model on randomly defined modules

# We find random modules that were identified both in the train and test and use those as input features for the model
common_modules_random = intersect(rownames(Random.modules$Enrichment_Random_Modules), rownames(Random.modules.test))
Random.modules.train = Random.modules$Enrichment_Random_Modules[common_modules_random, ]
Random.modules.test = Random.modules.test[common_modules_random, ]

Random.model = calibrate_model(Input = Random.modules.train, modules_metadata = train_cancer_metadata, label = "SUBTYPE", sampleIDs = "sampleID",Metric = "Exact_match", n_folds = 2, repetitions = 1)

Random.model[["ConfusionMatrix_training"]]

Validate classification model of randomly defined modules on the test set

Input.test <- t(Random.modules.test) %>% scale(center = T, scale = T)
Input.test<-Input.test[ , apply(Input.test, 2, function(x) !any(is.na(x)))]

Prediction.model <- predict(Random.model$Model, Input.test)

# We compute the confusion metrix on the test set
ConfusionMatrix_testing_random <- caret::confusionMatrix(as.factor(Prediction.model), as.factor(test_cancer_metadata$SUBTYPE))
Random.model$ConfusionMatrix_testing_random<-ConfusionMatrix_testing_random
ConfusionMatrix_testing_random

(K) Train model on central genes'expression

Another way to evaluate the perfomance of the model trained on spongEffects scores is to compare its performances against a model trained on the central genes alone. The idea is to investigate if the module activity that reflects the contribution of miRNA regulation offers additional insights to the expression level of the central genes alone. Thus, as a baseline mode we use only the central genes part of the modules we used to calibrate the model in step G.

Once again, we need to verify that both train and test sets contain all the central genes of interest before model calibration.

Input.centralgenes.train <- train_cancer_gene_expr[rownames(train_cancer_gene_expr) %in% names(Sponge.modules), ]
Input.centralgenes.test <- test_cancer_gene_expr[rownames(test_cancer_gene_expr) %in% names(Sponge.modules), ]

common_modules = intersect(rownames(Input.centralgenes.train), rownames(Input.centralgenes.test))
Input.centralgenes.train = Input.centralgenes.train[common_modules, ]
Input.centralgenes.test = Input.centralgenes.test[common_modules, ]

# Calibrate model
CentralGenes.model = calibrate_model(Input = Input.centralgenes.train, modules_metadata = train_cancer_metadata, label = "SUBTYPE", sampleIDs = "sampleID",Metric = "Exact_match", n_folds = 1, repetitions = 1)

# Validate on test set
Input.centralgenes.test <- t(Input.centralgenes.test) %>% scale(center = T, scale = T)
CentralGenes.prediction <- predict(CentralGenes.model$Model, Input.centralgenes.test)

# We compute the confusion metrix on the test set
ConfusionMatrix_testing <- caret::confusionMatrix(as.factor(CentralGenes.prediction), as.factor(test_cancer_metadata$SUBTYPE))
CentralGenes.model$ConfusionMatrix_testing<-ConfusionMatrix_testing
ConfusionMatrix_testing

It is possible to compare the performances of the different models

plot_accuracy_sensitivity_specificity(trained_model=trained.model,central_genes_model=NA,
                                      random_model= Random.model,
                                      training_dataset_name="TCGA",testing_dataset_name="TCGA",
                                      subtypes=as.factor(test_cancer_metadata$SUBTYPE))

(Q) Interpretation of the results

We offer here a few ways to visualize and interpret the SPONGE modules obtained via spongEffects. First, we visualize modules driving subtype prediction. These could be of interest for further validation, to identify biomarkers for the disease of interest.

lollipop_plot=plot_top_modules(trained_model=trained.model, k_modules_red = 2, k_modules = 4)
lollipop_plot

Second, we can visualize the distribution of the spongEffects scores in the different groups of interest. spongEffects scores should follow a normal distribution. Divergences from it may highlight the presence of subsets of samples with characteristics different than the ones of the class they belong to.

We show here the distribution of the spongEffects scores for the train set, dividev by the 5 breast cancer subtypes

density_plot_train=plot_density_scores(trained_model=trained.model,spongEffects = train.modules, meta_data = train_cancer_metadata, label = "SUBTYPE", sampleIDs = "sampleID")
density_plot_train

Module driving prediction can also be visualized with an heatmap. At the moment, the user can choose one layer of annotation.

heatmap.train = plot_heatmaps(trained_model = trained.model,spongEffects = train.modules,
               meta_data = train_cancer_metadata, label = "SUBTYPE", sampleIDs = "sampleID",Modules_to_Plot = 2,
              show.rownames = F, show.colnames = F)
heatmap.train
heatmap.test = plot_heatmaps(trained_model = trained.model,spongEffects = test.modules,
               meta_data = test_cancer_metadata, label = "SUBTYPE", sampleIDs = "sampleID",Modules_to_Plot = 2,
              show.rownames = F, show.colnames = F)
heatmap.test

spongEffects can be interpreted as the effect of ceRNA-ceRNA regulation and miRNA-ceRNA regulation (see publication for more details). If miRNA data are available, it is interesting to identify miRNAs that are involved in the regulation of modules driving prediction. We offer a way to visualize these with an heatmap.

plot_involved_miRNAs_to_modules(sponge_modules=Sponge.modules,
                                trained_model=trained.model,
                                gene_mirna_candidates= train_genes_miRNA_candidates,
                                k_modules = 2,
                                filter_miRNAs = 0.0,
                                bioMart_gene_symbol_columns = "hgnc_symbol",
                                bioMart_gene_ensembl = "hsapiens_gene_ensembl")

(Q) stop the cluster

#stop your backend parallelisation if registered
stopCluster(cl) 


biomedbigdata/SPONGE documentation built on Feb. 6, 2023, 10:19 p.m.