This vignette illustrates some of the basics of web-scraping and some features of the myScrapers package - in particular simple web-scraping functions. We also show some functions in the package specifically designed to retrieve public health information for public health practitioners.

Web scraping basics in R

The basic toolkit is:

A basic knowledge of html is helpful - especially elements and tags. For example paragraphs are defined with the <p> tag, links with the <a> tag and so on. ("")`

Selectorgadget enables finding tags used in style sheets if page layout is more sophisticated.

Using rvest

rvest is built on the httr packages which encapsulates web-web communications. Given a url (weblink) rvest will read the page html or xml.

url <- ""


We can then extract information based on the structure of the html using the tags.

For example we can extract links or paragraphs...

read_html(url) %>%
  html_nodes("a") %>%

read_html(url) %>%
  html_nodes("p") %>%
  .[4] %>%

or links...

read_html(url) %>%
  html_nodes("a") %>%
  .[1:6] %>%

Using selector gadget

Sometimes web pages are laid out in a more complex manner or use style sheets, and the page developers may use individualised tags to layout the page. Selcector gadget can help identify these tags and aid data extraction.

Lets look at PHEs webpages and try to extract infromation.

url <- ""

read_html(url) %>%
  html_nodes(".column-half") %>%
  .[1:10] %>%


We have built a package based on these ideas to facilitate extracting data from webpages.

Installing the package

The package is only available on Github and can be downloaded using devtools.


Simple web-scraping

In myScrapers there are 2 primary functions built on rvest:

Both are built on top of web commands which underpin the exchange of information across the internet.

In R, given any weblink or url we can read the site with the GET function in httr.

Lets use https:\

url <- ""

html <- GET(url)

rvest <- read_html(url)

rvest %>%
  html_nodes("a") %>%
  .[1] %>%

In myScrapers there are 4 primary functions built on these:


We can use get_page_links to extract information from following page of PHE statistical releases.

url <- ""

get_page_links(url) %>%

Use cases

We'll use GP in hours syndromic surveillance data to illustrate further uses. This report "Monitors the number of people who visit their GP during surgery hours under the syndromic surveillance system."

The system publishes weekly reports and spreadsheets - to obtain a year's worth of these reports manually would require 104 separate downloads.

Using a webscraping approach this can be achieved in a few lines of code.

Identifying reports

The code below identifies all the pdf reports on the page.

urls <- ""

get_page_docs(urls) %>%
  head(10) %>%

We can then use the downloader package to download the pdfs:

## not run

(get_page_links(urls) %>%
  .[grepl("pdf$", .)] %>%
  head(10) %>%
  unique() %>%
  purrr::map(., ~download(.x, destfile = basename(.x))))

Identifying data (spreadsheet links)

We can take a similar approach to spreadsheets.

urls <- ""

get_page_links(urls) %>%
  .[grepl("xls.?$", .)] %>%
  head(4) %>%
  unique() %>%

  map(., ~downloader::download(.x, destfile = basename(.x)))
  purrr::map(., ~downloader::download(.x, destfile = basename(.x)))


Having downloaded the reports or spreadsheets it is now straightforward to import them for further analysis.

files <- list.files(pattern = ".xls")


data <- purrr::map(files[2], ~(read_excel(.x,  sheet = "Local Authority", na = "*", 
data <- purrr::map(files, ~(read_excel(.x,  sheet = "Local Authority", na = "*", 

    skip = 4)))


Scraping text

This follows the same principle. The function get_page_text is designed to extract text from a webpage. For example, we can extract the text from an article about Matt Hancock's description of "predictive prevention'.

r get_page_text("") %>% .[1:4]

Analysing Duncan Selbie's Friday messages

Using simple functions it is relatively easy to scrape Duncan Selbie's blogs into a data frame for further analysis.

The base URL is, and there are 8 pages of results so the first task is to create a list of urls.

url_ds <- ""
url_ds1 <- paste0(url_ds, "page/", 2:8)
urls_ds <- c(url_ds, url_ds1)

Then we can extract links and isolate those specific to the friday messages

links <- purrr::map(urls_ds, ~(get_page_links(.x))) 

friday_message <- links %>% purrr::flatten() %>%.[grepl("duncan-selbies-friday-message", .)] %>% .[!grepl("comments", .)] %>% unique()


and then extract blog text:


blog_text <- purrr::map(friday_message, ~(get_page_text(.x)))
blog_text <- purrr::map(blog_text, ~(str_remove(.x, "\\n")))
blog_text <- purrr::map(blog_text, ~(str_remove(.x, "    GOV.UK blogs use cookies to make the site simpler. Find out more about cookies\n  ")))
blog_text <- purrr::map(blog_text, ~(str_remove(.x, "Dear everyone")))

blog_title <- purrr::map(blog_text, 2)
names(blog_text) <- blog_title

blog_text1 <- purrr::map(blog_text, extract, 5:11)
blog_text2 <- purrr::map(blog_text1, data.frame)
blog_text2 <- purrr::map_df(blog_text2, bind_rows)
blog_text2 <- blog_text2 %>% mutate(text = clean_texts(.x..i..))

We can then visualise with, for example, a wordcloud.


corp <- corpus(blog_text2$text)
dfm <- dfm(corp, ngrams = 2, remove = c("government_licence", "open_government", "public_health", "official_blog", "blog_public", "health_england", "cancel_reply", "content available", "health_blog", "licence_v", "best_wishes", "otherwise_stated", "except_otherwise", "friday_messages", "best_wishes", 


textplot_wordcloud(dfm, color = viridis::magma(10))
textplot_wordcloud(dfm, color = viridis::plasma(n = 10))


Additional functions

I have added a few functions to the package.

get_dsph_england returns a list of local authorities and their current DsPH. It scrapes

dsph <- get_dsph_england()
dsph %>%

get_phe_catalogue identifies all the PHE publications on GOV.UK. For this function you have to set the n = argument. We recommend starting at n = 110. This produces an interactive searchable table of links.

cat <- get_phe_catalogue(n=110)


Reviewing parliamentary questions (PQs)

We have added a get_pq function built on the hansard package to extract PQs addressed to, answered by or mentioning PHE. This takes a start date as an argument in the form yyyy-mm-dd.


pqs <- get_pqs(start_date = "2019-01-01")
pqs <- get_pqs(start_date = "2019-03-01")



We can look at the categories of questions asked.

pqs %>%
  group_by(hansard_category) %>%
  count() %>%
  arrange(-n) %>%

Using myScrapers to extract NICE guidance

We can use the toolkit to extract NICE Public Health Guidance as follows:

url <- ""

links <- get_page_links(url)[13:22] ##first 10  sets of guidance
links1 <- purrr::map(links, ~(paste0("", .x, "/chapter/Recommendations")))

pander::panderOptions("", "multiline")
pander::panderOptions("table.alignment.default", "left")
recommendations <- purrr::map(links1, ~(get_page_text(.x))) %>% purrr::map(., data.frame) 
head(recommendations, 1) %>% 
search <- "google analytics understanding users"

g <- googlesearchR(search, n = 10)

g[13] %>% as.character() %>% get_page_text(.) %>%

