In Evatar/geneScrapeR: Scrape NCBI databases

knitr::opts_chunk$set(collapse = T, comment = '')
options(tibble.print_min = 5)
library (genescraper)

The package genescraper is meant to make it easy to extract the names of the genes mentioned in publications from the NCBI database PubMed. While any of the NCBI databases can be queried with the genescraper package only the publications that are on the PubTator website can be mined for genes.

There are three functions in genescraper. The fist function, scrapeIDs, uses the R package rentrez to search publications from NCBI databases and extracts their article id. This function stores all of the ids from publications that match your search criteria in a list. The search term entered into scrapeIDs may not be interpreted how you expect. There is a great example in the rentrez tutorial uner the section "Searching databases: entrez_search()". To ensure you get the result you expect you can use the PubMed Advanced Search Builder. After entering the search you are looking for you can copy the text generated by the advanced search builder into the scrapeIDs function. The search can be refined using a combintaion of different fields eg MeSH Terms, Date - Publication, etc. You can find information on how to use these fields at PubMed Help. Below is an example of how to use the advanced search builder to refine a search on prostate cancer.

The second function, extractGenes, mines the PubTator website and extracts the Entrez Gene ID for each gene mentioned in the title and abstract. This function may take a very long time to run. The runtime for searching one abstract and extracting the gene ids mentioned in it is aproximately 1 second. The function can run in parallel but if you are searching hundreds of thousands of abstracts it can still take many hours. The output of extractGenes is a list of the entrez gene ids found in each abstract. Below is an example of the output of these two funcitons. If there are no genes present in the title or abstract extractGenes returns NULL.

pmids <- scrapeIDs (dataBase = 'pubmed',
                    term = '(prostate cancer[MeSH Terms]) AND ("2017"[Date - Publication]')

pmids[1:10]

The scrapeIDs returns a list of all of the IDs that match the search criteria. These IDs can then be passed to the extractGenes function. Which accesses the PubTator website and extracts all of the gene ids associated with each article. The output is NULL if no gene is mentioned in the abstract.

genes <- extractGenes (IDs = pmids[1:100],
                       nCores = 2,
                       nTries = 5)

unlist (genes[1:10])

geneSymbols <- cleanGenes (geneList = genes)

geneSymbols

Using ggplot2 we can look at which genes are most commonly mentioned in connection with prostate cancer. To make the plot more readable we will only look at the top 15.

library (ggplot2)
library (magrittr)
library (dplyr)

geneSymbols$human[1:15, ] %>%
  ggplot () +
  geom_bar (mapping = aes (x = reorder(geneSymbol, -n), y = n),
            stat = 'identity',
            width = 0.5) +
  xlab ('Gene') +
  ylab ('Counts') +
  theme (axis.text.x = element_text (angle = 90, hjust = 1))