Case study 3: Triangulating causal estimates with literature evidence

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

This RMarkdown document demonstrates how key elements from the notebook for case study 3 in the EpiGraphDB paper can be achieved using the R package. For detailed explanations of the case study please refer to the paper or the case study notebook.

Context

The biomedical literature contains a wealth of information than far exceeds our capacity for systematic manual extraction. For this reason, there are many existing literature mining methods to extract the key concepts and content. Here we use data from SemMedDB, a well established database that provides subject-predicate-object triples from all PubMed titles and abstracts. Using a subset of this data we created MELODI-presto (https://melodi-presto.mrcieu.ac.uk/), a method to assign triples to any given biomedical query via a PubMed search and some basic enrichment, and have applied this systematically to traits represented in EpiGraphDB. This allows us to identify overlapping terms connecting any set of GWAS traits, e.g. exposure and disease outcome. From here we can attempt to triangulate causal estimates, and conversely, check the mechanisms identified from the literature against the causal evidence.

This case study goes as follows:

library("magrittr")
library("dplyr")
library("purrr")
library("glue")
library("ggplot2")
library("igraph")
library("epigraphdb")

Here we set the starting trait, which we will use to explore associated disease traits.

STARTING_TRAIT <- "Sleep duration"

Get traits MR association

Given an exposure trait, find all traits with causal evidence. This method searches the causal evidence data for cases where our exposure trait has a potential casual effect on an outcome trait.

get_mr <- function(trait) {
  endpoint <- "/mr"
  params <- list(
    exposure_trait = trait,
    pval_threshold = 1e-10
  )
  mr_df <- query_epigraphdb(route = endpoint, params = params, mode = "table")
  mr_df
}

mr_df <- get_mr(STARTING_TRAIT)
mr_df %>% glimpse()

Map outcome traits to disease

For this example, we are interested in traits mapped to a disease node. To do this we utilise the mapping from GWAS trait to Disease via EFO term.

# NOTE: this block takes a bit of time, depending on the size of `mr_df` above.
#       We will not run the block here in package building,
#       but readers are encouraged to try it offline!
trait_to_disease <- function(trait) {
  endpoint <- "/ontology/gwas-efo-disease"
  params <- list(trait = trait)
  disease_df <- query_epigraphdb(route = endpoint, params = params, mode = "table")
  if (nrow(disease_df) > 0) {
    res <- disease_df %>% pull(`disease.label`)
  } else {
    res <- c()
  }
  res
}

outcome_disease_df <- mr_df %>%
  select(`outcome.trait`) %>%
  distinct() %>%
  mutate(disease = map(`outcome.trait`, trait_to_disease)) %>%
  filter(map_lgl(`disease`, function(x) !is.null(x)))
outcome_disease_df

Take one example to look for literature evidence

For the multiple exposure -> outcome relationships as reported from the table above, here we look at the literature evidence for one pair in detail:

The following looks for enriched triples of information (Subject-Predicate-Object) associated with our two traits. These have been derived via PubMed searches and corresponding SemMedDB data.

get_gwas_pair_literature <- function(gwas_id, assoc_gwas_id) {
  endpoint <- "/literature/gwas/pairwise"
  # NOTE in this example we blacklist to semmentic types
  params <- list(
    gwas_id = gwas_id,
    assoc_gwas_id = assoc_gwas_id,
    by_gwas_id = TRUE,
    pval_threshold = 1e-1,
    semmantic_types = "nusq",
    semmantic_types = "dsyn",
    blacklist = TRUE,
    limit = 1000
  )
  lit_df <- query_epigraphdb(route = endpoint, params = params, mode = "table")
  lit_df
}

GWAS_ID_X <- "ieu-a-1088"
GWAS_ID_Y <- "ieu-a-6"
lit_df <- get_gwas_pair_literature(GWAS_ID_X, GWAS_ID_Y)

glimpse(lit_df)
# Predicate counts for SemMed triples for trait X
lit_df %>%
  count(`s1.predicate`) %>%
  arrange(desc(n))
# Predicate counts for SemMed triples for trait Y
lit_df %>%
  count(`s2.predicate`) %>%
  arrange(desc(n))

Filter the data by predicates

Sometimes it is preferable to filter the SemMedDB data, e.g. to remove less informative Predicates, such as COEXISTS_WITH and ASSOCIATED_WITH.

# Filter out some predicates that are not informative
pred_filter <- c("COEXISTS_WITH", "ASSOCIATED_WITH")
lit_df_filter <- lit_df %>%
  filter(
    !`s1.predicate` %in% pred_filter,
    !`s2.predicate` %in% pred_filter
  )
lit_df_filter %>%
  count(`s1.predicate`) %>%
  arrange(desc(n))
lit_df_filter %>%
  count(`s2.predicate`) %>%
  arrange(desc(n))

Literature results

If we explore the full table in lit_df_filter, we can see lots of links between the two traits, pinned on specific overlapping terms.

We can summarise the SemMedDB semantic type and number of overlapping terms:

lit_counts <- lit_df_filter %>%
  count(`st.type`, `st.name`) %>%
  arrange(`st.type`, desc(`n`))
lit_counts %>% print(n = 30)

Note, the SemMedDB semantic types have been pre-filtered to only include a subset of possibilities.

Plotting the overlapping term frequencies

We can also visualise the above table as a bar chart. In this case we will remove Ethanol as it is an outlier.

lit_counts %>%
  filter(n < 100) %>%
  {
    ggplot(.) +
      aes(x = `st.name`, y = `n`) +
      geom_col() +
      geom_text(
        aes(label = `n`),
        position = position_dodge(0.9),
        hjust = 0
      ) +
      coord_flip()
  }

Look in detail at one overlapping term

Here we look at cases where leptin is the central overlapping term.

focus_term <- "leptin"
lit_detail <- lit_df_filter %>% filter(`st.name` == focus_term)
lit_detail %>% head()

We can create a network diagram to visualise these relationships.

lit_detail <- lit_detail %>%
  mutate_at(vars(`gwas.trait`, `assoc_gwas.trait`), stringr::str_to_upper)

# add node types: 1 - selected GWAS, 2 - traits from literature, 3 - current focus term connecting 1 and 2
nodes <- bind_rows(
  lit_detail %>% select(node = `gwas.trait`) %>% distinct() %>% mutate(node_type = 1),
  lit_detail %>% select(node = `assoc_gwas.trait`) %>% distinct() %>% mutate(node_type = 1),
  lit_detail %>% select(node = `st1.name`) %>% distinct() %>% mutate(node_type = 2),
  lit_detail %>% select(node = `st2.name`) %>% distinct() %>% mutate(node_type = 2),
  lit_detail %>% select(node = `st.name`) %>% distinct() %>% mutate(node_type = 3),
) %>% distinct()
nodes
edges <- bind_rows(
  # exposure -> s1 subject
  lit_detail %>%
    select(node = `gwas.trait`, assoc_node = `st1.name`) %>%
    distinct(),
  # s2 object -> outcome
  lit_detail %>%
    select(node = `st2.name`, assoc_node = `assoc_gwas.trait`) %>%
    distinct(),
  # s1 subject - s1 predicate -> s1 object
  lit_detail %>%
    select(
      node = `st1.name`, assoc_node = `st.name`,
      label = `s1.predicate`
    ) %>%
    distinct(),
  # s2 subject - s2 predicate -> s2 object
  lit_detail %>%
    select(
      node = `st.name`, assoc_node = `st2.name`,
      label = `s2.predicate`
    ) %>%
    distinct()
) %>%
  distinct()
edges
plot_network <- function(edges, nodes) {

  graph <- graph_from_data_frame(edges, directed = TRUE, vertices = nodes)
  graph$layout <- layout_with_kk

  # generate colors based on node type
  colors <- c("tomato", "lightblue", "gold")
  V(graph)$color <- colors[V(graph)$node_type]

  # Configure canvas
  default_mar <- par("mar")
  new_mar <- c(0, 0, 0, 0)
  par(mar = new_mar)

  plot.igraph(
    graph,
    vertex.size = 13,
    vertex.label.color = "black",
    vertex.label.family = "Helvetica",
    vertex.label.cex = 0.8,
    edge.arrow.size = 0.4,
    edge.label.color = "black",
    edge.label.family = "Helvetica",
    edge.label.cex = 0.5
  )
  par(mar = default_mar)
}
plot_network(edges, nodes)

Checking the source literature

We can refer back to the articles to check the text that was used to derive the SemMedDB data. This is important due to the imperfect nature of the SemRep annotation process (https://semrep.nlm.nih.gov/).

get_literature <- function(gwas_id, semmed_triple_id) {
  endpoint <- "/literature/gwas"
  params <- list(
    gwas_id = gwas_id,
    semmed_triple_id = semmed_triple_id,
    by_gwas_id = TRUE,
    pval_threshold = 1e-1
  )
  df <- query_epigraphdb(route = endpoint, params = params, mode = "table")
  df %>% select(`triple.id`, `triple.name`, `lit.id`)
}

pub_df <- bind_rows(
  lit_detail %>%
    select(gwas_id = `gwas.id`, semmed_triple_id = `s1.id`) %>%
    distinct(),
  lit_detail %>%
    select(gwas_id = `assoc_gwas.id`, semmed_triple_id = `s2.id`) %>%
    distinct()
) %>%
  transpose() %>%
  map_df(function(x) get_literature(x$gwas_id, x$semmed_triple_id))
pub_df

sessionInfo

sessionInfo()


Try the epigraphdb package in your browser

Any scripts or data that you put into this service are public.

epigraphdb documentation built on March 29, 2021, 5:11 p.m.