In nealhaddaway/CDCscraper: Scrapes Citation Information From the website for the Centers for Disease Control and Prevention

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  error = TRUE
)

Often we want to save search results from organisational websites; for example, when conducting grey literature searches in a systematic review. This package allows users to conduct an advanced search in one such website, that of the Centers for Disease Control and Prevention (CDC), exporting the results and a report of the searching as a dataframe object and a .txt. report, respectively.

The process consists of several main steps. Generating a functional URL for each page of 10 search results Saving each results page as an HTML document (for transparent reporting) Scraping the (10) search results for each page Converting the results into a table containing the titles, links and short descriptions of each result * Saving a text report file that documents the search activities in full (for transparent reporting)

Generating URLS for search result pages

Begin by installing the development version of the package from Github with:

install.packages("devtools", repos = "http://cran.us.r-project.org")
devtools::install_github("nealhaddaway/CDCscraper")
library(CDCscraper)

The CDC website allows the following information to be searched through it's main interface and through the advanced search function [terms in square brackets represent the function input equivalents]:

with all of the words (terms separated by AND in Boolean search logic) [and_terms]
with the exact phrase (exact phrases in Boolean search logic) [exact_phrase]
with at least one of the words [terms separated by OR in Boolean search logic) [or_terms]
without the words (terms separated by NOT in Boolean search logic) [not_terms]
dated between [date_from and date_to]

Using the buildCDClink() function, you can input any of these fields and generate one or more URLs for pages of search results. In addition to the above fields, the following information can be provided. Empty fields will result in use of defaults or non-use of the field):

start page (the page of results you want to start generating from) [start_page, default is 1]
pages (the number of pages to generate URLs for following on from the start page) [pages, default is 1]
language (the language used for searches and Google Scholar's interface) [language, default is en (English)]

Consider this example:

and_terms <- c('disease', 'spread')
not_terms <- c('animal', 'fish')
exact_phrase <- c('corona virus')
or_terms <- c('pandemic', 'global')
date_from <- '01/01/1980'
date_to <- '10/11/2020'
link <- buildCDClinks(and_terms = and_terms, 
                     exact_phrase = exact_phrase, 
                     or_terms = or_terms, 
                     not_terms = not_terms,
                     date_from = date_from,
                     date_to = date_to)
link

The output "https://search.cdc.gov/search/index.html?all=disease%20spread&none=animal%20fish&exact=corona%20virus&any=pandemic%20global&date1=01%2F01%2F1980&date2=10%2F11%2F20201" is a functioning URL constructed from the inputs we provided above.

We can modify this to provide us with a full list of viable URLS for all pages of search results - in this case 562 as of October 13th 2020:

links <- buildCDClinks(and_terms = and_terms, 
                     exact_phrase = exact_phrase, 
                     or_terms = or_terms, 
                     not_terms = not_terms,
                     date_from = date_from,
                     date_to = date_to,
                     start_page = 1,
                     pages = 6)
head(links)

We can see that the links vary here only in the page= parameter, indicating the record from which each page of results should be displayed.

Downloading pages of search results as code

Now that we have our list of URLs, we can download each page of results as an HTML file using the save_codes() function. The function makes use of the RSelenium package to scrape the results from the CDC website. This is necessary because the results are otherwise hidden behind an 'accordion', a dynamic way to compress information in a web page.

The function opens a browser of your choice (browser = 'firefox', for example) and exports what it finds. The function pauses for 1 second to allow the page to load fully before exporting. This means that the scraping will take at least n seconds, where n is the number of pages of search results being scraped.

The html files that are saved are simplified versions of the html code, and so are not particularly attractive. By default, this function downloads the files to the working directory unless otherwise specified.

try_links <- links[1:5]
codes <- save_codes(urls = try_links)

Scrape HTML files

Now we have our results downloaded, we can scrape them locally for citation information.

The get_info() function loads in all HTML files in the working directory and scrapes the following data for each record from the HTML(s) (codes in brackets correspond to the .ris file equivalent field codes):

The article title (TI)
The provided short description (AB)
The link to the article record (most often the publisher's website) (UR)

This returns a tibble as follows:

get_info()

info <- tibble::as_tibble(readRDS('info.rds'))
info

Wrapper function

All of the stages above can be performed together by making use of the global wrapper function save_and_scrapeCDC() as follows:

and_terms <- c('disease', 'spread')
not_terms <- c('animal', 'fish')
exact_phrase <- c('corona virus')
or_terms <- c('pandemic', 'global')
date_from <- '01/01/1980'
date_to <- '10/11/2020'
info <- save_and_scrapeCDC(and_terms, 
                           not_terms, 
                           exact_phrase, 
                           or_terms, 
                           date_from,
                           date_to,
                           pages = 5)

This file downloads all html files and saves them to the working directory, scrapes out the relevant citation data and produces an info object, and saves a text file report describing the searching in detail.

nealhaddaway/CDCscraper documentation built on Oct. 21, 2020, 5:20 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

nealhaddaway/CDCscraper
Scrapes Citation Information From the website for the Centers for Disease Control and Prevention

In nealhaddaway/CDCscraper: Scrapes Citation Information From the website for the Centers for Disease Control and Prevention

Generating URLS for search result pages

Downloading pages of search results as code

Scrape HTML files

Wrapper function

R Package Documentation

Browse R Packages

We want your feedback!

nealhaddaway/CDCscraper Scrapes Citation Information From the website for the Centers for Disease Control and Prevention

In nealhaddaway/CDCscraper: Scrapes Citation Information From the website for the Centers for Disease Control and Prevention

Generating URLS for search result pages

Downloading pages of search results as code

Scrape HTML files

Wrapper function

R Package Documentation

Browse R Packages

We want your feedback!

nealhaddaway/CDCscraper
Scrapes Citation Information From the website for the Centers for Disease Control and Prevention