knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Compile transcriptome-driven efficacy estimates of target(gene)-disease associations

The R package ThETA implements two novel algorithms to identify and rank target-disease associations based on efficacy estimates compiled from gene expression profiles of gene perturbations and human diseases (Modulation Score), and tissue-specific gene expression networks (Tissue-specific Efficacy Scores). These methods are described in Failli et al. 2019 (https://www.nature.com/articles/s41598-019-46293-7).

Current functions provided by ThETA

1. How to compile the tissue-specific efficacy estimates of target(gene)-disease associations

In order to compile tissue-specific efficacy estimates of drug-target disease associations we need to:

These steps are computationally expensive! Therefore, ThETA provides pre-compiled .rda files that can be used to rapidly generate tissue-specific efficacy scores.

+-----------------------------+----------------------------------------------------------------------------------------------+ | .rda file | Description | +=============================+==============================================================================================+ | gtexv7_zscore.rda | z-scores compiled from log transformed TPM expression profiles of GTEx. | +-----------------------------+----------------------------------------------------------------------------------------------+ | ppi_strdb_700.rda | human protein-protein interaction network extracted from StringDB (combined scores >= 700) | +-----------------------------+----------------------------------------------------------------------------------------------+ | dis_vrnts.rda | disease-associated genes (from DisGeNET; score >= 0.6) | +-----------------------------+----------------------------------------------------------------------------------------------+ | disease_tissue_zscores.rda | significances (z-scores) of disease-tissue associations | +-----------------------------+----------------------------------------------------------------------------------------------+ | dis_vrnts.rda | tissue-specific node centrality scores (integration of degree, clust. coeff. and betweenness | +-----------------------------+----------------------------------------------------------------------------------------------+

library(dplyr)

First, we upload the ThETA package and the 5 .rda files:

library(ThETA)
data(gtexv7_zscore)
data(ppi_strdb_700)
data(dis_vrnts)
data(disease_tissue_zscores)
data(centrality_score)

Then, given a disease (i.e. Diabetes Mellitus Type II - T2DM), we compile the tissue-specific efficacy scores.

  1. Variant genes related to T2DM are selected from dis_vrnts by using the EFO-id.
T2DM_genes = dis_vrnts[[which(names(dis_vrnts) == "EFO:0001360")]]
  1. Then, significant tissues for T2DM are obtained from disease_tissue_zscores.
T2DM_rel_tissue_scores = disease_tissue_zscores$z[which(rownames(disease_tissue_zscores$z) == "EFO:0001360"),]
  1. A tissue-specific efficacy (TSE) score is then estimated for all genes that are expressed in the tissues whic are relevant for T2DM. It should be noted that the following script is computer-intensive. Indeed, we specified only two genes in input. However, it is highly recommended to use the whole set of T2D-relevant genes.
T2DM_Tscores <- tissue.specific.scores(T2DM_genes$entrez[1:2], 
                                        ppi_network = ppi_strdb_700, 
                                        directed_network = FALSE, 
                                        tissue_expr_data = gtexv7_zscore,
                                        dis_relevant_tissues = T2DM_rel_tissue_scores, 
                                        W = centrality_score$borda.disc, 
                                        cutoff = 4, verbose = TRUE)

The output is a data.frame object containing the TSE score for all genes-tissue pairs.

knitr::kable(T2DM_Tscores[1:5,], format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) 

This data.frame can be subsequently ordered based on the average of the TSE scores in order to prioritize putative gene targets (e.g. top 50 genes).

T2DM_top50 <- T2DM_Tscores[order(T2DM_Tscores$avg_tissue_score, 
                                  decreasing = TRUE)[1:50],]

The following plot shows the distribution of the TSE scores for the top 50 genes within each disease relevant tissue.

library(ggplot2)
library(reshape)
data_t2d50 <- reshape::melt(as.matrix(T2DM_top50), id = 0)
colnames(data_t2d50) <- c("EntrezID", "Tissue", "EfficacyScore")
ggplot(data_t2d50, aes(x = Tissue, y = EfficacyScore, fill = Tissue)) +
        geom_boxplot(alpha = 0.7) +
        ggtitle("Boxplot of the efficacy scores for tissues") +
        theme_bw() +
        theme(plot.title = element_text(size = 8, face = "bold"),
                text = element_text(size = 7),
                axis.title = element_text(face="bold"),
                axis.text.x=element_text(size = 7)) 

Additionally, the following barplot reports the tissue-specific scores for the top 5 genes within each tissue.

library(ggplot2)
library(reshape)
data_t2d5 <- reshape::melt(as.matrix(T2DM_top50[1:5,-ncol(T2DM_top50)]), id = 0)
colnames(data_t2d5) <- c("EntrezID", "Tissue", "EfficacyScore")
data_t2d5$EntrezID = factor(data_t2d5$EntrezID)
ggplot(data_t2d5, aes(x = EntrezID, y = EfficacyScore, fill = EntrezID)) +
        geom_bar(stat='identity', alpha = 0.7) +
        ggtitle("Barplot comparing the efficacy scores of genes acrsso significant tissues") +
        facet_wrap(~Tissue) +
        theme_bw() + 
        theme(plot.title = element_text(size = 8, face = "bold"),
                    text = element_text(size = 7),
                    axis.title = element_text(face="bold"),
                    axis.text.x=element_text(size = 7)) 

2. How to compile the modulation efficacy estimates of target(gene)-disease associations

In order to compile modulation efficacy estimates of drug-target disease associations the users need to collect lists of up- and down-regulated gene sets identified in disease and gene perturbations. Currentely, ThETA includes lists of up- or down-regulated gene sets retrieved from EnrichR (https://amp.pharm.mssm.edu/Enrichr/). However, the users could compile the modulation score based on a different set of up- and down-regulated gene sets.

data(geo_gene_sets)

The ThETA package provides a function to calculate the modulation score for all the genes (since it is not a computationaly intensive task).

modulation_scores <- modulation.score(geneSets = geo_gene_sets, verbose = TRUE)
knitr:::kable(modulation_scores[1:5, ], format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) 

The following section show how to deal with gene-disease repetitions

EnrichR provides either Disease Ontology (DO) ids or Concept Unique Identifiers (CUIs) to label disease perturbations. Use of different types of disease id may cause gene-disease pair repetitions in the final output of modulation.score (different ids might be associated with the same disease). To overcome this issue, a .csv file containing manually curated mapping between either DO ids or CUIs and EFO ids is available in the data folder.

The following code shows how to:

First, DO ids or CUIs are replaced with EFO ids.

enrichr_to_efo <- read.csv(system.file("conversion_enrichr_efo.csv", 
                                        package = "ThETA"), row.names = 1,
                            stringsAsFactors = F)
modulation_scores$disease.id <- enrichr_to_efo[modulation_scores$disease.id,'disease.id']

Then, gene symbols need to be converted to Entrez Gene IDs in order to facilitate the integration between TSE and modulation scores.

library(org.Hs.eg.db)
modulation_scores$target.entrez <- AnnotationDbi::mapIds(org.Hs.eg.db, modulation_scores$target.id,'ENTREZID','SYMBOL')
modulation_scores <- modulation_scores[modulation_scores$disease.id != '' &
                                       !is.na(modulation_scores$target.entrez),]

Finally, for each gene-disease pair, only the perturbation giving the maximum score are selected (see Failli et al. 2019).

library(data.table)
modul_score <- data.table::as.data.table(modulation_scores)
modul_score <- as.data.frame(modul_score[, .SD[which.max(modscore)], 
                                           by=list(disease.id, target.entrez)])

The following example shows how to use the ppvpercents function to generate and visualize PPV-curves. It requires 5 inputs: a list of sets of known target-disese associations, a matrix containing efficacy estimates and character vectors indicating the columns for entrez gene IDs, disease IDs (EFO-id) and efficacy sccores.

data(GS_TTD)
data(GS_DBK)

vm <- ppvpercents(list(TTD=GS_TTD, BDK=GS_DBK), 
                  modul_score,
                  entrez.col = "target.entrez", 
                  disease.col = "disease.id", 
                  score.col = c("modscore"))
vm

Let's now select the modulation scores for T2D.

T2DM_Mscores = data.frame(modul_score[modul_score$disease.id=='EFO:0001360', 
                                      c("target.entrez", "modscore")], row.names = 1)
knitr:::kable(head(T2DM_Mscores), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
  kableExtra::scroll_box(width = "200px", height = "300px")

3. How to integrate TSE and modulation scores

The tissue-specifc and modulation scores can be combined together in order to provide a multi-evidence based ranking of disease-gene-targets.

common_t2d_genes <- intersect(rownames(T2DM_Mscores), rownames(T2DM_Tscores)) 
T2DM_Iscores <- data.frame("Mscore" = T2DM_Mscores[common_t2d_genes,], 
                            "TSEscore" = T2DM_Tscores[common_t2d_genes,],
                            row.names = common_t2d_genes)
knitr:::kable(T2DM_Iscores[1:5, c(1:3, ncol(T2DM_Iscores))], format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
  kableExtra::scroll_box(width = "700px", height = "250px")

Multi-evidence rankings of putative drug targets can be further extended by including efficacy scores obtained from other computational platform for drug target discovery such as Open Target platform.

The OT Platform REST API allows access to data available on the OT Platform. The following examples shows how to retrieve disease-gene association scores from the OT platform.

A typical access to the OT Platform REST API requires three inputs: name server, endpoint parameters and optional parameters.

server <- 'https://platform-api.opentargets.io/v3/platform'
endpoint_prmtrs <- '/public/association/filter'
optional_prmtrs <- '?size=10000&disease=EFO_0001360&fields=disease.id&fields=target.gene_info.symbol&fields=association_score.overall&fields=disease.efo_info.label'
uri <- paste(server,endpoint_prmtrs,optional_prmtrs,sep='')

Then, a GET request is made to pull raw data into our environment. Pulled data, in the JavaScript Object Notification (JSON) format, are subsequently converted into a usable format.

if("httr" %in% rownames(installed.packages()) == FALSE) {install.packages("httr")}
if("jsonlite" %in% rownames(installed.packages()) == FALSE) {install.packages("jsonlite")}
library(httr)
library(jsonlite)

get_association_json <- httr::content(httr::GET(uri),'text')
get_association_usable <- jsonlite::fromJSON(get_association_json, flatten = TRUE)

OT_score <- get_association_usable$data[,c(2:3,1,4)]
OT_score$disease.id <- gsub('_',':',OT_score$disease.id)
colnames(OT_score)[c(1,4)] <- c('target.id', 'disease.name')

# remove duplicated gene symbols
OT_score = OT_score[-which(duplicated(OT_score$target.id)),]

Gene symbols are then converted to Entrez Gene IDs in order to allign the OT scores with those provided by ThETA.

library(org.Hs.eg.db)
OT_score$target.entrez <- AnnotationDbi::mapIds(org.Hs.eg.db,OT_score$target.id,'ENTREZID','SYMBOL')
OT_score <- OT_score[!is.na(OT_score$target.entrez),]
knitr:::kable(head(OT_score), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
  kableExtra::scroll_box(width = "700px", height = "150px")

The scores obtained from the OT platform are first concatenated to the TSE and modulation scores.

all_scores <- base::merge(OT_score, T2DM_Iscores, by.x = "target.entrez", by.y = "row.names", all = TRUE)

Then, the function integrate.scores is used to provide meerged scores: harmonic sum or maximum score.

T2DM_allsc <- integrate.scores(all_scores, c("association_score.overall",
                                             "Mscore", 
                                             "TSEscore.avg_tissue_score"))
T2DM_allsc <- T2DM_allsc[order(T2DM_allsc$HS, decreasing = TRUE),]
rownames(T2DM_allsc) <- T2DM_allsc[,1]

# let's semplify the final table of the disease-gene association scores
tab_score <- T2DM_allsc[,c("target.id","association_score.overall", "Mscore", 
                           "TSEscore.avg_tissue_score", "HS","MAX")]
colnames(tab_score)[1:4] <- c("GeneTarget","OTScore","ModulationScore","TissueEfficacyScore")
knitr:::kable(head(tab_score), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
  kableExtra::scroll_box(width = "700px", height = "250px")

4. How to visualize tissue-specific networks and biological annotations of selected drug(gene) targets

A shiny-based application was built for the visualization of tissue-specific gene networks highlighting connections between disease-genes and drug(gene)-targets.

library(shiny)
library(visNetwork)
library(org.Hs.eg.db)

visualize.graph(tissue_scores = T2DM_Tscores, 
                 disease_genes = T2DM_genes$entrez[1:5],
                 ppi_network = ppi_strdb_700, 
                 tissue_expr_data = gtexv7_zscore,
                 top_targets = rownames(T2DM_allsc)[1:5], 
                 db='BP')

The following example shows how to use the function build_tissue_specific_networks which returns

tsrwr = build.tissue.specific.networks(tissue_scores = T2DM_Tscores, disease_genes = T2DM_genes$entrez,
                                       ppi_network = ppi_strdb_700, tissue_expr_data = gtexv7_zscore, 
                                       top_targets = rownames(T2DM_top50)[1:5], verbose = FALSE)

Then, ThETA provides functions to compile

library(org.Hs.eg.db)
T2D_ora_data_shp = generate.ora.data(tsrwr$shp[[1]], databases = "KEGG")
T2D_ora_data_rwr = generate.ora.data(tsrwr$rwr, databases = "KEGG")
T2D_ora_plot_rwr = generate.ora.plots(T2D_ora_data_rwr, set_plots = c("dotplot","cnetplot"), 
                                      showCategory = 5, font_size = 10)
figure <- ggpubr::ggarrange(plotlist = T2D_ora_plot_rwr[1:2], nrow = 2, ncol = 1, 
                            common.legend = TRUE, legend = "bottom", labels=names(T2D_ora_plot_rwr)[1:2])
figure
library(org.Hs.eg.db)
pmc_genes = as.character(AnnotationDbi::mapIds(org.Hs.eg.db, rownames(T2DM_allsc)[c(1:5)], 'SYMBOL', 'ENTREZID'))
print(pmc_genes)
T2D_pmd_plot_top = novelty.plots(pmc_genes, font_size = 14, pubmed = c(2010,2018))
T2D_pmd_plot_top


vittoriofortino84/ThETA documentation built on May 23, 2021, 4:24 a.m.