knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
The R package ThETA implements two novel algorithms to identify and rank target-disease associations based on efficacy estimates compiled from gene expression profiles of gene perturbations and human diseases (Modulation Score), and tissue-specific gene expression networks (Tissue-specific Efficacy Scores). These methods are described in Failli et al. 2019 (https://www.nature.com/articles/s41598-019-46293-7).
Current functions provided by ThETA
In order to compile tissue-specific efficacy estimates of drug-target disease associations we need to:
These steps are computationally expensive! Therefore, ThETA provides pre-compiled .rda files that can be used to rapidly generate tissue-specific efficacy scores.
+-----------------------------+----------------------------------------------------------------------------------------------+ | .rda file | Description | +=============================+==============================================================================================+ | gtexv7_zscore.rda | z-scores compiled from log transformed TPM expression profiles of GTEx. | +-----------------------------+----------------------------------------------------------------------------------------------+ | ppi_strdb_700.rda | human protein-protein interaction network extracted from StringDB (combined scores >= 700) | +-----------------------------+----------------------------------------------------------------------------------------------+ | dis_vrnts.rda | disease-associated genes (from DisGeNET; score >= 0.6) | +-----------------------------+----------------------------------------------------------------------------------------------+ | disease_tissue_zscores.rda | significances (z-scores) of disease-tissue associations | +-----------------------------+----------------------------------------------------------------------------------------------+ | dis_vrnts.rda | tissue-specific node centrality scores (integration of degree, clust. coeff. and betweenness | +-----------------------------+----------------------------------------------------------------------------------------------+
library(dplyr)
First, we upload the ThETA package and the 5 .rda files:
library(ThETA) data(gtexv7_zscore) data(ppi_strdb_700) data(dis_vrnts) data(disease_tissue_zscores) data(centrality_score)
Then, given a disease (i.e. Diabetes Mellitus Type II - T2DM), we compile the tissue-specific efficacy scores.
T2DM_genes = dis_vrnts[[which(names(dis_vrnts) == "EFO:0001360")]]
T2DM_rel_tissue_scores = disease_tissue_zscores$z[which(rownames(disease_tissue_zscores$z) == "EFO:0001360"),]
T2DM_Tscores <- tissue.specific.scores(T2DM_genes$entrez[1:2], ppi_network = ppi_strdb_700, directed_network = FALSE, tissue_expr_data = gtexv7_zscore, dis_relevant_tissues = T2DM_rel_tissue_scores, W = centrality_score$borda.disc, cutoff = 4, verbose = TRUE)
The output is a data.frame object containing the TSE score for all genes-tissue pairs.
knitr::kable(T2DM_Tscores[1:5,], format = 'html') %>% kableExtra::kable_styling(bootstrap_options = "striped", full_width = F)
This data.frame can be subsequently ordered based on the average of the TSE scores in order to prioritize putative gene targets (e.g. top 50 genes).
T2DM_top50 <- T2DM_Tscores[order(T2DM_Tscores$avg_tissue_score, decreasing = TRUE)[1:50],]
The following plot shows the distribution of the TSE scores for the top 50 genes within each disease relevant tissue.
library(ggplot2) library(reshape) data_t2d50 <- reshape::melt(as.matrix(T2DM_top50), id = 0) colnames(data_t2d50) <- c("EntrezID", "Tissue", "EfficacyScore") ggplot(data_t2d50, aes(x = Tissue, y = EfficacyScore, fill = Tissue)) + geom_boxplot(alpha = 0.7) + ggtitle("Boxplot of the efficacy scores for tissues") + theme_bw() + theme(plot.title = element_text(size = 8, face = "bold"), text = element_text(size = 7), axis.title = element_text(face="bold"), axis.text.x=element_text(size = 7))
Additionally, the following barplot reports the tissue-specific scores for the top 5 genes within each tissue.
library(ggplot2) library(reshape) data_t2d5 <- reshape::melt(as.matrix(T2DM_top50[1:5,-ncol(T2DM_top50)]), id = 0) colnames(data_t2d5) <- c("EntrezID", "Tissue", "EfficacyScore") data_t2d5$EntrezID = factor(data_t2d5$EntrezID) ggplot(data_t2d5, aes(x = EntrezID, y = EfficacyScore, fill = EntrezID)) + geom_bar(stat='identity', alpha = 0.7) + ggtitle("Barplot comparing the efficacy scores of genes acrsso significant tissues") + facet_wrap(~Tissue) + theme_bw() + theme(plot.title = element_text(size = 8, face = "bold"), text = element_text(size = 7), axis.title = element_text(face="bold"), axis.text.x=element_text(size = 7))
In order to compile modulation efficacy estimates of drug-target disease associations the users need to collect lists of up- and down-regulated gene sets identified in disease and gene perturbations. Currentely, ThETA includes lists of up- or down-regulated gene sets retrieved from EnrichR (https://amp.pharm.mssm.edu/Enrichr/). However, the users could compile the modulation score based on a different set of up- and down-regulated gene sets.
data(geo_gene_sets)
The ThETA package provides a function to calculate the modulation score for all the genes (since it is not a computationaly intensive task).
modulation_scores <- modulation.score(geneSets = geo_gene_sets, verbose = TRUE)
knitr:::kable(modulation_scores[1:5, ], format = 'html') %>% kableExtra::kable_styling(bootstrap_options = "striped", full_width = F)
The following section show how to deal with gene-disease repetitions
EnrichR provides either Disease Ontology (DO) ids or Concept Unique Identifiers (CUIs) to label disease perturbations. Use of different types of disease id may cause gene-disease pair repetitions in the final output of modulation.score (different ids might be associated with the same disease). To overcome this issue, a .csv file containing manually curated mapping between either DO ids or CUIs and EFO ids is available in the data folder.
The following code shows how to:
First, DO ids or CUIs are replaced with EFO ids.
enrichr_to_efo <- read.csv(system.file("conversion_enrichr_efo.csv", package = "ThETA"), row.names = 1, stringsAsFactors = F) modulation_scores$disease.id <- enrichr_to_efo[modulation_scores$disease.id,'disease.id']
Then, gene symbols need to be converted to Entrez Gene IDs in order to facilitate the integration between TSE and modulation scores.
library(org.Hs.eg.db) modulation_scores$target.entrez <- AnnotationDbi::mapIds(org.Hs.eg.db, modulation_scores$target.id,'ENTREZID','SYMBOL') modulation_scores <- modulation_scores[modulation_scores$disease.id != '' & !is.na(modulation_scores$target.entrez),]
Finally, for each gene-disease pair, only the perturbation giving the maximum score are selected (see Failli et al. 2019).
library(data.table) modul_score <- data.table::as.data.table(modulation_scores) modul_score <- as.data.frame(modul_score[, .SD[which.max(modscore)], by=list(disease.id, target.entrez)])
The following example shows how to use the ppvpercents function to generate and visualize PPV-curves. It requires 5 inputs: a list of sets of known target-disese associations, a matrix containing efficacy estimates and character vectors indicating the columns for entrez gene IDs, disease IDs (EFO-id) and efficacy sccores.
data(GS_TTD) data(GS_DBK) vm <- ppvpercents(list(TTD=GS_TTD, BDK=GS_DBK), modul_score, entrez.col = "target.entrez", disease.col = "disease.id", score.col = c("modscore")) vm
Let's now select the modulation scores for T2D.
T2DM_Mscores = data.frame(modul_score[modul_score$disease.id=='EFO:0001360', c("target.entrez", "modscore")], row.names = 1)
knitr:::kable(head(T2DM_Mscores), format = 'html') %>% kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>% kableExtra::scroll_box(width = "200px", height = "300px")
The tissue-specifc and modulation scores can be combined together in order to provide a multi-evidence based ranking of disease-gene-targets.
common_t2d_genes <- intersect(rownames(T2DM_Mscores), rownames(T2DM_Tscores)) T2DM_Iscores <- data.frame("Mscore" = T2DM_Mscores[common_t2d_genes,], "TSEscore" = T2DM_Tscores[common_t2d_genes,], row.names = common_t2d_genes)
knitr:::kable(T2DM_Iscores[1:5, c(1:3, ncol(T2DM_Iscores))], format = 'html') %>% kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>% kableExtra::scroll_box(width = "700px", height = "250px")
Multi-evidence rankings of putative drug targets can be further extended by including efficacy scores obtained from other computational platform for drug target discovery such as Open Target platform.
The OT Platform REST API allows access to data available on the OT Platform. The following examples shows how to retrieve disease-gene association scores from the OT platform.
A typical access to the OT Platform REST API requires three inputs: name server, endpoint parameters and optional parameters.
server <- 'https://platform-api.opentargets.io/v3/platform' endpoint_prmtrs <- '/public/association/filter' optional_prmtrs <- '?size=10000&disease=EFO_0001360&fields=disease.id&fields=target.gene_info.symbol&fields=association_score.overall&fields=disease.efo_info.label' uri <- paste(server,endpoint_prmtrs,optional_prmtrs,sep='')
Then, a GET
request is made to pull raw data into our environment. Pulled data, in the JavaScript Object Notification (JSON) format, are subsequently converted into a usable format.
if("httr" %in% rownames(installed.packages()) == FALSE) {install.packages("httr")} if("jsonlite" %in% rownames(installed.packages()) == FALSE) {install.packages("jsonlite")} library(httr) library(jsonlite) get_association_json <- httr::content(httr::GET(uri),'text') get_association_usable <- jsonlite::fromJSON(get_association_json, flatten = TRUE) OT_score <- get_association_usable$data[,c(2:3,1,4)] OT_score$disease.id <- gsub('_',':',OT_score$disease.id) colnames(OT_score)[c(1,4)] <- c('target.id', 'disease.name') # remove duplicated gene symbols OT_score = OT_score[-which(duplicated(OT_score$target.id)),]
Gene symbols are then converted to Entrez Gene IDs in order to allign the OT scores with those provided by ThETA.
library(org.Hs.eg.db) OT_score$target.entrez <- AnnotationDbi::mapIds(org.Hs.eg.db,OT_score$target.id,'ENTREZID','SYMBOL') OT_score <- OT_score[!is.na(OT_score$target.entrez),]
knitr:::kable(head(OT_score), format = 'html') %>% kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>% kableExtra::scroll_box(width = "700px", height = "150px")
The scores obtained from the OT platform are first concatenated to the TSE and modulation scores.
all_scores <- base::merge(OT_score, T2DM_Iscores, by.x = "target.entrez", by.y = "row.names", all = TRUE)
Then, the function integrate.scores is used to provide meerged scores: harmonic sum or maximum score.
T2DM_allsc <- integrate.scores(all_scores, c("association_score.overall", "Mscore", "TSEscore.avg_tissue_score")) T2DM_allsc <- T2DM_allsc[order(T2DM_allsc$HS, decreasing = TRUE),] rownames(T2DM_allsc) <- T2DM_allsc[,1] # let's semplify the final table of the disease-gene association scores tab_score <- T2DM_allsc[,c("target.id","association_score.overall", "Mscore", "TSEscore.avg_tissue_score", "HS","MAX")] colnames(tab_score)[1:4] <- c("GeneTarget","OTScore","ModulationScore","TissueEfficacyScore")
knitr:::kable(head(tab_score), format = 'html') %>% kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>% kableExtra::scroll_box(width = "700px", height = "250px")
A shiny-based application was built for the visualization of tissue-specific gene networks highlighting connections between disease-genes and drug(gene)-targets.
library(shiny) library(visNetwork) library(org.Hs.eg.db) visualize.graph(tissue_scores = T2DM_Tscores, disease_genes = T2DM_genes$entrez[1:5], ppi_network = ppi_strdb_700, tissue_expr_data = gtexv7_zscore, top_targets = rownames(T2DM_allsc)[1:5], db='BP')
The following example shows how to use the function build_tissue_specific_networks which returns
tsrwr = build.tissue.specific.networks(tissue_scores = T2DM_Tscores, disease_genes = T2DM_genes$entrez, ppi_network = ppi_strdb_700, tissue_expr_data = gtexv7_zscore, top_targets = rownames(T2DM_top50)[1:5], verbose = FALSE)
Then, ThETA provides functions to compile
library(org.Hs.eg.db) T2D_ora_data_shp = generate.ora.data(tsrwr$shp[[1]], databases = "KEGG") T2D_ora_data_rwr = generate.ora.data(tsrwr$rwr, databases = "KEGG")
T2D_ora_plot_rwr = generate.ora.plots(T2D_ora_data_rwr, set_plots = c("dotplot","cnetplot"), showCategory = 5, font_size = 10)
figure <- ggpubr::ggarrange(plotlist = T2D_ora_plot_rwr[1:2], nrow = 2, ncol = 1, common.legend = TRUE, legend = "bottom", labels=names(T2D_ora_plot_rwr)[1:2]) figure
library(org.Hs.eg.db) pmc_genes = as.character(AnnotationDbi::mapIds(org.Hs.eg.db, rownames(T2DM_allsc)[c(1:5)], 'SYMBOL', 'ENTREZID')) print(pmc_genes) T2D_pmd_plot_top = novelty.plots(pmc_genes, font_size = 14, pubmed = c(2010,2018)) T2D_pmd_plot_top
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.