Getting started with EpiGraphDB in R

This article is provided as a brief introductory guide to working with the EpiGraphDB platform through epigraphdb R package. Here we will demonstrate a few basic operations that can be carried out using the platform, but for more advanced methods please refer to the API endpoint documentation.

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library("dplyr")
library("epigraphdb")

Part 1: Using EpiGraphDB to obtain biological mappings

With EpiGraphDB you can map genetic variants to genes, genes to proteins, proteins to pathways, pathways to diseases and so on, as shown in the network diagram here.

In this part, we want to demonstrate how to do basic mappings between biological entities. We are going to map genes to proteins (i.e. their UniProtID), proteins to pathways that they are found in (using Reactome data), and then extract information on the specific pathways identified.

But first, let's talk about the basic querying syntax. query_epigraphdb is the main querying function in the package; it is used to communicate with EpiGraphDB by specifying API endpoints.

query_epigraphdb(
  route = endpoint, # supply the route / endpoint
  params = list(... = ...), # supply the query parameters
  mode = "table", # How the results are shown in R
  method = "GET" # HTTP method, "GET", "POST", etc.
)


In this guide, we are going to use the all-purpose query_epigraphdb function in all basic examples. However, many of the most common queries have been wrapped in specific functions within epigraphdb R package for the ease of use. Those are very helpful, but to help the understanding of the core principles behind using EpiGraphDB, here we present the ways to run queries in a less abstracted way.

Mapping genes to proteins

In this first section, we will take an arbitrary list of genes and query the EpiGraphDB API to find the proteins that they map to.

# Let's test this on a few genes
genes <- c("TP53", "BRCA1", "TNF")

# Select the endpoint "POST /mappings/gene-to-protein"
endpoint <- "/mappings/gene-to-protein"
method <- "POST"

# Build a query
proteins <- query_epigraphdb(
  route = endpoint,
  params = list(gene_name_list = genes),
  mode = "table",
  method = method
)

# Check the output
print(proteins)

In the above data frame, we see the output from querying EpiGraphDB for the proteins that have been mapped to the genes TP53, BRCA1, and TNF. Our query returned the UniProt and Ensembl IDs for those genes.

Mapping proteins to pathways

Next, to demonstrate the mapping of proteins to pathways, we are going to take one protein from the previous example, P04637, and query EpiGraphDB for all pathways it is known to be involved in.

# Let's see what pathways the protein product of TP53 gene is involved in

# NOTE: Argument `uniprot_id_list` requires a list of UniProt IDs in
#       POST /mappings/gene-to-protein.
#       In this case for the R package, we need to wrap `proteins_uniprot_ids`
#       with an `I()` function (AsIs) to prevent auto-unpacking by `httr`
proteins_uniprot_ids <- c("P04637") %>% I()

endpoint <- "/protein/in-pathway"

pathway_df <- query_epigraphdb(
  route = endpoint,
  params = list(uniprot_id_list = proteins_uniprot_ids),
  mode = "table",
  method = "POST"
)

# Check out how many pathways this protein is found in
print(pathway_df)

# Get pathways names (Reactome IDs)
print(pathway_df$pathway_reactome_id[[1]])

P04637 is involved in five pathways. Next, let's get pathway info for one of them.

Get pathway info

# Let's see what exactly this pathway is
reactome_id <- "R-HSA-6804754"

endpoint <- "/meta/nodes/Pathway/search"

pathway_info <- query_epigraphdb(
  route = endpoint,
  params = list(id = reactome_id),
  mode = "table"
)

# Pathway description
print(pathway_info)

If you are interested in this type of analysis, check out case studies 1 and 2 for further details on pathways analysis, PPI, mapping drugs to targets etc.


Running the above queries using the dedicated wrapped functions:

mappings_gene_to_protein(genes)
protein_in_pathway(proteins_uniprot_ids)

Part 2: Epidemiological relationships analysis

In this part we will demonstrate queries that may be relevant in epidemiology research.

Look up GWAS studies

First, we want to check what GWAS are available within EpiGraphDB for our trait of interest, e.g. Body mass index. Doing this query is equivalent doing a look-up using EpiGraphDB Web UI. The search functionality is fuzzy search and case insensitive, i.e. 'body mass index' or 'Body Mass Index' will give you the same set of results.

# Let's see what Body mass index GWAS are available in EpiGraphDB
trait <- "body mass index"

endpoint <- "/meta/nodes/Gwas/search"

results <- query_epigraphdb(
  route = endpoint,
  params = list(name = trait),
  mode = "table"
)

# show selected columns in the results
results %>%
  select(
    node.trait, node.id, node.sample_size,
    node.year, node.author
  )

Explore Mendelian randomization studies

In these examples, we show how to extract pre-computed MR results for the specified exposure, outcome, or both, traits of interest.

Specify exposure trait

# Look up all MR analyses where a trait was used as exposure
# and find all outcome traits with causal evidence from it.

trait1 <- "Body mass index"
# NB: here, trait name has to specific (not fuzzy):
# use the exact trait name wording as in GWAS `node.trait` (previous example)

endpoint <- "/mr"

mr_df <- query_epigraphdb(
  route = endpoint,
  params = list(
    exposure_trait = trait1,
    pval_threshold = 5e-08
  ),
  mode = "table"
)
print(mr_df)

The returned data frame includes all MR analysis with exposure.trait being "Body mass index". However, there are several GWAS with this names. If you are interested in a specific GWAS, you will need to filter by exposure.id.

# Show how many MR analyses were done for each Body mass index GWAS
mr_df %>% count(exposure.id)

Specify outcome trait

Next, we can check all available MR analyses for an outcome trait of interest.

# Look up all MR for a specified outcome trait
# and find all exposure traits with causal evidence on it.

trait2 <- "Waist circumference"

endpoint <- "/mr"

mr_df <- query_epigraphdb(
  route = endpoint,
  params = list(
    outcome_trait = trait2,
    pval_threshold = 5e-08
  ),
  mode = "table"
)
print(mr_df)

Specify both exposure and outcome traits

Finally, we can look up MR causal inference results for a pair of exposure and outcome.

# Look up a specific pair of exposure+outcome
trait1 <- "Body mass index"
trait2 <- "Coronary heart disease"

endpoint <- "/mr"

mr_df <- query_epigraphdb(
  route = endpoint,
  params = list(
    exposure_trait = trait1,
    outcome_trait = trait2
  ),
  mode = "table"
)

print(mr_df)

To query EpiGraphDB directly by GWAS ID, you will need to use the advanced functionality. See the end of this article.


Running the above MR query using a dedicated wrapped function:

mr(
  exposure_trait = trait1,
  outcome_trait = trait2
)

Part 3. Looking for literature evidence

Accessing information in the literature is a ubiquitous task in research, be it for novel hypothesis generation or as part of evidence triangulation. EpiGraphDB facilitates fast processing of this information by allowing access to a host of literature-mined relationships that have been structured into semantic triples. These take the general form (subject, predicate, object) and have been generated using contemporary natural language processing techniques applied to a massive amount of published biomedical research papers.

In the following section, we will query the API for the relationship between a given gene and a disease outcome.

# Identity all publications where a gene is mentioned with relation to a disease

gene <- "IL23R"
trait <- "Inflammatory bowel disease"

endpoint <- "/gene/literature"

lit_df <- query_epigraphdb(
  route = endpoint,
  params = list(
    gene_name = gene,
    object_name = trait
  ),
  mode = "table"
)

# Review the found evidence in the literature
print(lit_df)

The data frame above shows that IL23R has been mentioned in 25 publications (pubmed_id column) in relation to Inflammatory bowel disease, in four predicates.

# Get a list of all PubMed IDs
lit_df %>%
  pull(pubmed_id) %>%
  unlist() %>%
  unique()

If you are interested in literature mining analysis, and also matching MR results to literature evidence, please refer to more specific examples in case studies 3 and 2.

EpiGraphDB node search

EpiGraphDB stores data as nodes (data types) and edges (relationships between nodes). The available nodes can be viewed through the meta/nodes endpoint. Let's list the available nodes:

endpoint <- "/meta/nodes/list"
meta_node_fields <- query_epigraphdb(
  route = endpoint, params = NULL, mode = "raw"
)
meta_node_fields %>% unlist()

The list of nodes is also available in the documentation, along with the node properties, which are useful for running native Cypher queries. In this section, we want to show how to perform node search for a term of interest. Any node can be searched by specifying it in this query: GET /meta/nodes/{meta_node}/search.

To demonstrate this, we are going to search for 'breast cancer' in various nodes:

name <- "breast cancer"

endpoint <- "/meta/nodes/Gwas/search"
results <- query_epigraphdb(
  route = endpoint,
  params = list(name = name),
  mode = "table"
)
results %>%
  select(
    node.trait, node.id, node.sample_size,
    node.year, node.author
  )

endpoint <- "/meta/nodes/Disease/search"
results <- query_epigraphdb(
  route = endpoint,
  params = list(name = name),
  mode = "table"
)
results %>%
  select(node.label, node.id, node.definition)

endpoint <- "/meta/nodes/Efo/search"
results <- query_epigraphdb(
  route = endpoint,
  params = list(name = name),
  mode = "table"
)
results

The queries above can be further expanded using Cypher as will be shown in the next section. To get more information about meta endpoint please see the guidelines on meta functionalities.

Advanced examples

The functionalities of epigraphdb R package and the REST API of EpiGraphDB are currently limited to a certain number of API endpoints available via the query_epigraphdb function, which are simply a small and limited subset of what a graph database offers. If you would like to further customise your query, EpiGraphDB API supports using Neo4j Cypher to directly query the graph database.

To get you started, we want to show that the majority of API endpoint queries are simple wrappers around Cypher queries which directly request data from the graph database. For example, the simple GWAS query we've done in Part 2 using "table" mode, can be executed using "raw" mode to expose the exact Cypher query that was run against the database:

# Running a GWAS query from Part 2
# Ask for "raw" format (as a list)
trait <- "body mass index"
endpoint <- "/meta/nodes/Gwas/search"
response <- query_epigraphdb(
  route = endpoint,
  params = list(name = trait),
  mode = "raw"
)

# display the Cypher query
response$metadata$query

This is what a native Cypher query looks like:

query <- "
    MATCH (node: Gwas)
    WHERE node.trait =~ \"(?i).*body mass index.*\"
    RETURN node LIMIT 10;
"

This is how you supply a Cypher query to query_epigraphdb function:

# use POST /cypher

endpoint <- "/cypher"
method <- "POST"
params <- list(query = query)

results <- query_epigraphdb(
  route = endpoint,
  params = params,
  method = method,
  mode = "table"
)

# The result should be identical to the example in Part 2
results %>%
  select(
    node.trait, node.id, node.sample_size,
    node.year, node.author
  )

Next step: let's modify Cypher query

# Let's return Body mass index GWAS that were done by Locke AE

query <- "
    MATCH (node: Gwas)
    WHERE node.trait =~ \"(?i).*body mass index.*\"
    AND node.author = \"Locke AE\"
    RETURN node;
"

endpoint <- "/cypher"
method <- "POST"
params <- list(query = query)

results_subset <- query_epigraphdb(
  route = endpoint,
  params = params,
  method = method,
  mode = "table"
)

results_subset %>%
  select(
    node.trait, node.id, node.sample_size,
    node.year, node.author
  )

NOTE: Be mindful of the data type of each node property. Please refer to data dictionary to explore data types before writing native Cypher queries.


Now, let's try making queries by specifying a GWAS ID.

# Extract info only for GWAS 'ieu-a-2'
query <- "
    MATCH (node: Gwas)
    WHERE node.id = \"ieu-a-2\"
    RETURN node;
"
endpoint <- "/cypher"
params <- list(query = query)
results_subset <- query_epigraphdb(
  route = endpoint,
  params = params,
  method = "POST",
  mode = "table"
)

results_subset %>%
  select(
    node.trait, node.id, node.sample_size,
    node.year, node.author
  )
# Return MR results only for exposure trait 'ieu-a-2' (body mass index)

# first let's check the MR Cypher query that we run in "table" mode in Part 2
trait1 <- "Body mass index"
endpoint <- "/mr"
mr_df <- query_epigraphdb(
  route = endpoint,
  params = list(
    exposure_trait = trait1,
    pval_threshold = 5e-08
  ),
  mode = "raw"
)
mr_df$metadata$query

# modify the query to only return 'ieu-a-2' GWAS results
query <- "
  MATCH (exposure:Gwas)-[mr:MR_EVE_MR]->(outcome:Gwas)
  WHERE exposure.id = \"ieu-a-2\"
  AND mr.pval < 5e-08
  RETURN exposure {.id, .trait}, outcome {.id, .trait}, mr {.b, .se, .pval, .method, .selection, .moescore}
  ORDER BY mr.pval ;
"

endpoint <- "/cypher"
params <- list(query = query)
results_subset <- query_epigraphdb(
  route = endpoint,
  params = params,
  method = "POST",
  mode = "table"
)
results_subset

# check exposures in the results
results_subset %>% count(exposure.id)


Great! You can now use the basic functionality of the R package and make simple Cypher queries to the API. Next, we recommend to work through the case studies and check out the Web UI examples and the EpiGraphDB Gallery.



Try the epigraphdb package in your browser

Any scripts or data that you put into this service are public.

epigraphdb documentation built on March 29, 2021, 5:11 p.m.