In vittoriofortino84/ThETA: Transcriptome-based efficacy estimates of target(gene)-disease associations

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Compile transcriptome-driven efficacy estimates of target(gene)-disease associations

The R package ThETA implements two novel algorithms to identify and rank target-disease associations based on efficacy estimates compiled from gene expression profiles of gene perturbations and human diseases (Modulation Score), and tissue-specific gene expression networks (Tissue-specific Efficacy Scores). These methods are described in Failli et al. 2019 (https://www.nature.com/articles/s41598-019-46293-7).

Current functions provided by ThETA

Compile tissue-specific expression networks by using GTEx and StringDB (Human PPI).
Compile disease-relevant tissues by implementing the algorithm proposed by Kitsak et al. 2016 (https://www.nature.com/articles/srep35241).
Extract disease-relevant genes from DisGeNET and mark these genes on the disease-relevant tissue-specific gene expression.
Compile the tissue-specific efficacy scores on disease-relevant tissues.
Compile the modulation score, which estimates the likelihood of a gene perturbation (e.g., knockout and knockdown) to result in specific reversion of disease gene-expression profiles (lists of down- and up-regulated genes are downloaded from Enrichr: https://amp.pharm.mssm.edu/Enrichr/).
Integrate multiple efficacy scores with the max function and the harmonic sum (https://www.ncbi.nlm.nih.gov/pmc/articles/PMC5210543/).
Build igraph objects including tissue-specific networks and paths connecting selected drug-targets (or genes) and disease-relevant genes (it also includes info on the gene modulation scores).

1. How to compile the tissue-specific efficacy estimates of target(gene)-disease associations

In order to compile tissue-specific efficacy estimates of drug-target disease associations we need to:

Collect tissue-specific gene expression profiles from GTEX.
Define a protein-protein interaction network.
Identify disease-associated genes (from DisGeNET) and disease relevant tissues.
Compile node centrality scores

These steps are computationally expensive! Therefore, ThETA provides pre-compiled .rda files that can be used to rapidly generate tissue-specific efficacy scores.

+-----------------------------+----------------------------------------------------------------------------------------------+ | .rda file | Description | +=============================+==============================================================================================+ | gtexv7_zscore.rda | z-scores compiled from log transformed TPM expression profiles of GTEx. | +-----------------------------+----------------------------------------------------------------------------------------------+ | ppi_strdb_700.rda | human protein-protein interaction network extracted from StringDB (combined scores >= 700) | +-----------------------------+----------------------------------------------------------------------------------------------+ | dis_vrnts.rda | disease-associated genes (from DisGeNET; score >= 0.6) | +-----------------------------+----------------------------------------------------------------------------------------------+ | disease_tissue_zscores.rda | significances (z-scores) of disease-tissue associations | +-----------------------------+----------------------------------------------------------------------------------------------+ | dis_vrnts.rda | tissue-specific node centrality scores (integration of degree, clust. coeff. and betweenness | +-----------------------------+----------------------------------------------------------------------------------------------+

library(dplyr)

First, we upload the ThETA package and the 5 .rda files:

library(ThETA)
data(gtexv7_zscore)
data(ppi_strdb_700)
data(dis_vrnts)
data(disease_tissue_zscores)
data(centrality_score)

Then, given a disease (i.e. Diabetes Mellitus Type II - T2DM), we compile the tissue-specific efficacy scores.

Variant genes related to T2DM are selected from dis_vrnts by using the EFO-id.

T2DM_genes = dis_vrnts[[which(names(dis_vrnts) == "EFO:0001360")]]

Then, significant tissues for T2DM are obtained from disease_tissue_zscores.

T2DM_rel_tissue_scores = disease_tissue_zscores$z[which(rownames(disease_tissue_zscores$z) == "EFO:0001360"),]

A tissue-specific efficacy (TSE) score is then estimated for all genes that are expressed in the tissues whic are relevant for T2DM. It should be noted that the following script is computer-intensive. Indeed, we specified only two genes in input. However, it is highly recommended to use the whole set of T2D-relevant genes.

T2DM_Tscores <- tissue.specific.scores(T2DM_genes$entrez[1:2], 
                                        ppi_network = ppi_strdb_700, 
                                        directed_network = FALSE, 
                                        tissue_expr_data = gtexv7_zscore,
                                        dis_relevant_tissues = T2DM_rel_tissue_scores, 
                                        W = centrality_score$borda.disc, 
                                        cutoff = 4, verbose = TRUE)

The output is a data.frame object containing the TSE score for all genes-tissue pairs.

knitr::kable(T2DM_Tscores[1:5,], format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F)

This data.frame can be subsequently ordered based on the average of the TSE scores in order to prioritize putative gene targets (e.g. top 50 genes).

T2DM_top50 <- T2DM_Tscores[order(T2DM_Tscores$avg_tissue_score, 
                                  decreasing = TRUE)[1:50],]

The following plot shows the distribution of the TSE scores for the top 50 genes within each disease relevant tissue.

library(ggplot2)
library(reshape)
data_t2d50 <- reshape::melt(as.matrix(T2DM_top50), id = 0)
colnames(data_t2d50) <- c("EntrezID", "Tissue", "EfficacyScore")
ggplot(data_t2d50, aes(x = Tissue, y = EfficacyScore, fill = Tissue)) +
        geom_boxplot(alpha = 0.7) +
        ggtitle("Boxplot of the efficacy scores for tissues") +
        theme_bw() +
        theme(plot.title = element_text(size = 8, face = "bold"),
                text = element_text(size = 7),
                axis.title = element_text(face="bold"),
                axis.text.x=element_text(size = 7))

Additionally, the following barplot reports the tissue-specific scores for the top 5 genes within each tissue.

library(ggplot2)
library(reshape)
data_t2d5 <- reshape::melt(as.matrix(T2DM_top50[1:5,-ncol(T2DM_top50)]), id = 0)
colnames(data_t2d5) <- c("EntrezID", "Tissue", "EfficacyScore")
data_t2d5$EntrezID = factor(data_t2d5$EntrezID)
ggplot(data_t2d5, aes(x = EntrezID, y = EfficacyScore, fill = EntrezID)) +
        geom_bar(stat='identity', alpha = 0.7) +
        ggtitle("Barplot comparing the efficacy scores of genes acrsso significant tissues") +
        facet_wrap(~Tissue) +
        theme_bw() + 
        theme(plot.title = element_text(size = 8, face = "bold"),
                    text = element_text(size = 7),
                    axis.title = element_text(face="bold"),
                    axis.text.x=element_text(size = 7))

2. How to compile the modulation efficacy estimates of target(gene)-disease associations

In order to compile modulation efficacy estimates of drug-target disease associations the users need to collect lists of up- and down-regulated gene sets identified in disease and gene perturbations. Currentely, ThETA includes lists of up- or down-regulated gene sets retrieved from EnrichR (https://amp.pharm.mssm.edu/Enrichr/). However, the users could compile the modulation score based on a different set of up- and down-regulated gene sets.

data(geo_gene_sets)

The ThETA package provides a function to calculate the modulation score for all the genes (since it is not a computationaly intensive task).

modulation_scores <- modulation.score(geneSets = geo_gene_sets, verbose = TRUE)

knitr:::kable(modulation_scores[1:5, ], format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F)

The following section show how to deal with gene-disease repetitions

EnrichR provides either Disease Ontology (DO) ids or Concept Unique Identifiers (CUIs) to label disease perturbations. Use of different types of disease id may cause gene-disease pair repetitions in the final output of modulation.score (different ids might be associated with the same disease). To overcome this issue, a .csv file containing manually curated mapping between either DO ids or CUIs and EFO ids is available in the data folder.

The following code shows how to:

Cross-link the output of modulation.score with the .csv file in data.
Remove duplicated gene-disease pairs from the output.

First, DO ids or CUIs are replaced with EFO ids.

enrichr_to_efo <- read.csv(system.file("conversion_enrichr_efo.csv", 
                                        package = "ThETA"), row.names = 1,
                            stringsAsFactors = F)
modulation_scores$disease.id <- enrichr_to_efo[modulation_scores$disease.id,'disease.id']

Then, gene symbols need to be converted to Entrez Gene IDs in order to facilitate the integration between TSE and modulation scores.

library(org.Hs.eg.db)
modulation_scores$target.entrez <- AnnotationDbi::mapIds(org.Hs.eg.db, modulation_scores$target.id,'ENTREZID','SYMBOL')
modulation_scores <- modulation_scores[modulation_scores$disease.id != '' &
                                       !is.na(modulation_scores$target.entrez),]

Finally, for each gene-disease pair, only the perturbation giving the maximum score are selected (see Failli et al. 2019).

library(data.table)
modul_score <- data.table::as.data.table(modulation_scores)
modul_score <- as.data.frame(modul_score[, .SD[which.max(modscore)], 
                                           by=list(disease.id, target.entrez)])

The following example shows how to use the ppvpercents function to generate and visualize PPV-curves. It requires 5 inputs: a list of sets of known target-disese associations, a matrix containing efficacy estimates and character vectors indicating the columns for entrez gene IDs, disease IDs (EFO-id) and efficacy sccores.

data(GS_TTD)
data(GS_DBK)

vm <- ppvpercents(list(TTD=GS_TTD, BDK=GS_DBK), 
                  modul_score,
                  entrez.col = "target.entrez", 
                  disease.col = "disease.id", 
                  score.col = c("modscore"))
vm

Let's now select the modulation scores for T2D.

T2DM_Mscores = data.frame(modul_score[modul_score$disease.id=='EFO:0001360', 
                                      c("target.entrez", "modscore")], row.names = 1)

knitr:::kable(head(T2DM_Mscores), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
  kableExtra::scroll_box(width = "200px", height = "300px")

3. How to integrate TSE and modulation scores

The tissue-specifc and modulation scores can be combined together in order to provide a multi-evidence based ranking of disease-gene-targets.

common_t2d_genes <- intersect(rownames(T2DM_Mscores), rownames(T2DM_Tscores)) 
T2DM_Iscores <- data.frame("Mscore" = T2DM_Mscores[common_t2d_genes,], 
                            "TSEscore" = T2DM_Tscores[common_t2d_genes,],
                            row.names = common_t2d_genes)

knitr:::kable(T2DM_Iscores[1:5, c(1:3, ncol(T2DM_Iscores))], format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
  kableExtra::scroll_box(width = "700px", height = "250px")

Multi-evidence rankings of putative drug targets can be further extended by including efficacy scores obtained from other computational platform for drug target discovery such as Open Target platform.

The OT Platform REST API allows access to data available on the OT Platform. The following examples shows how to retrieve disease-gene association scores from the OT platform.

A typical access to the OT Platform REST API requires three inputs: name server, endpoint parameters and optional parameters.

server <- 'https://platform-api.opentargets.io/v3/platform'
endpoint_prmtrs <- '/public/association/filter'
optional_prmtrs <- '?size=10000&disease=EFO_0001360&fields=disease.id&fields=target.gene_info.symbol&fields=association_score.overall&fields=disease.efo_info.label'
uri <- paste(server,endpoint_prmtrs,optional_prmtrs,sep='')

Then, a GET request is made to pull raw data into our environment. Pulled data, in the JavaScript Object Notification (JSON) format, are subsequently converted into a usable format.

if("httr" %in% rownames(installed.packages()) == FALSE) {install.packages("httr")}
if("jsonlite" %in% rownames(installed.packages()) == FALSE) {install.packages("jsonlite")}
library(httr)
library(jsonlite)

get_association_json <- httr::content(httr::GET(uri),'text')
get_association_usable <- jsonlite::fromJSON(get_association_json, flatten = TRUE)

OT_score <- get_association_usable$data[,c(2:3,1,4)]
OT_score$disease.id <- gsub('_',':',OT_score$disease.id)
colnames(OT_score)[c(1,4)] <- c('target.id', 'disease.name')

# remove duplicated gene symbols
OT_score = OT_score[-which(duplicated(OT_score$target.id)),]

Gene symbols are then converted to Entrez Gene IDs in order to allign the OT scores with those provided by ThETA.

library(org.Hs.eg.db)
OT_score$target.entrez <- AnnotationDbi::mapIds(org.Hs.eg.db,OT_score$target.id,'ENTREZID','SYMBOL')
OT_score <- OT_score[!is.na(OT_score$target.entrez),]

knitr:::kable(head(OT_score), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
  kableExtra::scroll_box(width = "700px", height = "150px")

The scores obtained from the OT platform are first concatenated to the TSE and modulation scores.

all_scores <- base::merge(OT_score, T2DM_Iscores, by.x = "target.entrez", by.y = "row.names", all = TRUE)

Then, the function integrate.scores is used to provide meerged scores: harmonic sum or maximum score.

T2DM_allsc <- integrate.scores(all_scores, c("association_score.overall",
                                             "Mscore", 
                                             "TSEscore.avg_tissue_score"))
T2DM_allsc <- T2DM_allsc[order(T2DM_allsc$HS, decreasing = TRUE),]
rownames(T2DM_allsc) <- T2DM_allsc[,1]

# let's semplify the final table of the disease-gene association scores
tab_score <- T2DM_allsc[,c("target.id","association_score.overall", "Mscore", 
                           "TSEscore.avg_tissue_score", "HS","MAX")]
colnames(tab_score)[1:4] <- c("GeneTarget","OTScore","ModulationScore","TissueEfficacyScore")

knitr:::kable(head(tab_score), format = 'html') %>%
  kableExtra::kable_styling(bootstrap_options = "striped", full_width = F) %>%
  kableExtra::scroll_box(width = "700px", height = "250px")

4. How to visualize tissue-specific networks and biological annotations of selected drug(gene) targets

A shiny-based application was built for the visualization of tissue-specific gene networks highlighting connections between disease-genes and drug(gene)-targets.

library(shiny)
library(visNetwork)
library(org.Hs.eg.db)

visualize.graph(tissue_scores = T2DM_Tscores, 
                 disease_genes = T2DM_genes$entrez[1:5],
                 ppi_network = ppi_strdb_700, 
                 tissue_expr_data = gtexv7_zscore,
                 top_targets = rownames(T2DM_allsc)[1:5], 
                 db='BP')

The following example shows how to use the function build_tissue_specific_networks which returns

tissue-specific networks (igraph objects);
shortest-paths linking a set of gene targets (e.g. top 5 from the tissue-specific efficacy score) to known disease-genes;
a list of genes closely related to the set of the specified gene targets.

tsrwr = build.tissue.specific.networks(tissue_scores = T2DM_Tscores, disease_genes = T2DM_genes$entrez,
                                       ppi_network = ppi_strdb_700, tissue_expr_data = gtexv7_zscore, 
                                       top_targets = rownames(T2DM_top50)[1:5], verbose = FALSE)

Then, ThETA provides functions to compile

biological annotations which are significantly associated with a set of genes (by using over-representation analysis);
plots for interpreting the ORA analysis;
pubmed trend plots based on a set of gene targets.

library(org.Hs.eg.db)
T2D_ora_data_shp = generate.ora.data(tsrwr$shp[[1]], databases = "KEGG")
T2D_ora_data_rwr = generate.ora.data(tsrwr$rwr, databases = "KEGG")

T2D_ora_plot_rwr = generate.ora.plots(T2D_ora_data_rwr, set_plots = c("dotplot","cnetplot"), 
                                      showCategory = 5, font_size = 10)

figure <- ggpubr::ggarrange(plotlist = T2D_ora_plot_rwr[1:2], nrow = 2, ncol = 1, 
                            common.legend = TRUE, legend = "bottom", labels=names(T2D_ora_plot_rwr)[1:2])
figure

library(org.Hs.eg.db)
pmc_genes = as.character(AnnotationDbi::mapIds(org.Hs.eg.db, rownames(T2DM_allsc)[c(1:5)], 'SYMBOL', 'ENTREZID'))
print(pmc_genes)
T2D_pmd_plot_top = novelty.plots(pmc_genes, font_size = 14, pubmed = c(2010,2018))
T2D_pmd_plot_top

vittoriofortino84/ThETA documentation built on May 23, 2021, 4:24 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

vittoriofortino84/ThETA
Transcriptome-based efficacy estimates of target(gene)-disease associations

In vittoriofortino84/ThETA: Transcriptome-based efficacy estimates of target(gene)-disease associations

Compile transcriptome-driven efficacy estimates of target(gene)-disease associations

1. How to compile the tissue-specific efficacy estimates of target(gene)-disease associations

2. How to compile the modulation efficacy estimates of target(gene)-disease associations

3. How to integrate TSE and modulation scores

4. How to visualize tissue-specific networks and biological annotations of selected drug(gene) targets

R Package Documentation

Browse R Packages

We want your feedback!

vittoriofortino84/ThETA Transcriptome-based efficacy estimates of target(gene)-disease associations

In vittoriofortino84/ThETA: Transcriptome-based efficacy estimates of target(gene)-disease associations

Compile transcriptome-driven efficacy estimates of target(gene)-disease associations

1. How to compile the tissue-specific efficacy estimates of target(gene)-disease associations

2. How to compile the modulation efficacy estimates of target(gene)-disease associations

3. How to integrate TSE and modulation scores

4. How to visualize tissue-specific networks and biological annotations of selected drug(gene) targets

R Package Documentation

Browse R Packages

We want your feedback!

vittoriofortino84/ThETA
Transcriptome-based efficacy estimates of target(gene)-disease associations