knitr::opts_chunk$set( collapse = TRUE, comment = "#>", error = TRUE )
Often we want to save search results from organisational websites; for example, when conducting grey literature searches in a systematic review. This package allows users to conduct an advanced search in one such website, that of the Centers for Disease Control and Prevention (CDC), exporting the results and a report of the searching as a dataframe object and a .txt. report, respectively.
The process consists of several main steps. Generating a functional URL for each page of 10 search results Saving each results page as an HTML document (for transparent reporting) Scraping the (10) search results for each page Converting the results into a table containing the titles, links and short descriptions of each result * Saving a text report file that documents the search activities in full (for transparent reporting)
Begin by installing the development version of the package from Github with:
install.packages("devtools", repos = "http://cran.us.r-project.org") devtools::install_github("nealhaddaway/CDCscraper") library(CDCscraper)
The CDC website allows the following information to be searched through it's main interface and through the advanced search function [terms in square brackets represent the function input equivalents]:
and_terms
]exact_phrase
]or_terms
]not_terms
]date_from
and date_to
]Using the buildCDClink()
function, you can input any of these fields and generate one or more URLs for pages of search results. In addition to the above fields, the following information can be provided. Empty fields will result in use of defaults or non-use of the field):
start_page
, default is 1
]pages
, default is 1
]language
, default is en
(English)]Consider this example:
and_terms <- c('disease', 'spread') not_terms <- c('animal', 'fish') exact_phrase <- c('corona virus') or_terms <- c('pandemic', 'global') date_from <- '01/01/1980' date_to <- '10/11/2020' link <- buildCDClinks(and_terms = and_terms, exact_phrase = exact_phrase, or_terms = or_terms, not_terms = not_terms, date_from = date_from, date_to = date_to) link
The output "https://search.cdc.gov/search/index.html?all=disease%20spread&none=animal%20fish&exact=corona%20virus&any=pandemic%20global&date1=01%2F01%2F1980&date2=10%2F11%2F20201"
is a functioning URL constructed from the inputs we provided above.
We can modify this to provide us with a full list of viable URLS for all pages of search results - in this case 562 as of October 13th 2020:
links <- buildCDClinks(and_terms = and_terms, exact_phrase = exact_phrase, or_terms = or_terms, not_terms = not_terms, date_from = date_from, date_to = date_to, start_page = 1, pages = 6) head(links)
We can see that the links vary here only in the page=
parameter, indicating the record from which each page of results should be displayed.
Now that we have our list of URLs, we can download each page of results as an HTML file using the save_codes()
function. The function makes use of the RSelenium package to scrape the results from the CDC website. This is necessary because the results are otherwise hidden behind an 'accordion', a dynamic way to compress information in a web page.
The function opens a browser of your choice (browser = 'firefox'
, for example) and exports what it finds. The function pauses for 1 second to allow the page to load fully before exporting. This means that the scraping will take at least n seconds
, where n
is the number of pages of search results being scraped.
The html files that are saved are simplified versions of the html code, and so are not particularly attractive. By default, this function downloads the files to the working directory unless otherwise specified.
try_links <- links[1:5] codes <- save_codes(urls = try_links)
Now we have our results downloaded, we can scrape them locally for citation information.
The get_info()
function loads in all HTML files in the working directory and scrapes the following data for each record from the HTML(s) (codes in brackets correspond to the .ris file equivalent field codes):
title
(TI)description
(AB)link
to the article record (most often the publisher's website) (UR)This returns a tibble as follows:
get_info()
info <- tibble::as_tibble(readRDS('info.rds')) info
All of the stages above can be performed together by making use of the global wrapper function save_and_scrapeCDC()
as follows:
and_terms <- c('disease', 'spread') not_terms <- c('animal', 'fish') exact_phrase <- c('corona virus') or_terms <- c('pandemic', 'global') date_from <- '01/01/1980' date_to <- '10/11/2020' info <- save_and_scrapeCDC(and_terms, not_terms, exact_phrase, or_terms, date_from, date_to, pages = 5)
This file downloads all html files and saves them to the working directory, scrapes out the relevant citation data and produces an info
object, and saves a text file report describing the searching in detail.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.