knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE, 
  message = FALSE, 
  cache = TRUE, 
  echo = TRUE, 
  comment = "#>"
)

library(tidyverse)
library(myScrapers)
library(Rcrawler)
library(rvest)

This vignette illustrates some of the basics of web-scraping and some features of the myScrapers package - in particular simple web-scraping functions. We also show some functions in the package specifically designed to retrieve public health information for public health practitioners.

Web scraping basics in R

The basic toolkit is:

A basic knowledge of html is helpful - especially elements and tags. For example paragraphs are defined with the <p> tag, links with the <a> tag and so on. ("https://www.w3schools.com/html/html_intro.asp")`

Selectorgadget enables finding tags used in style sheets if page layout is more sophisticated.

Using rvest

rvest is built on the httr packages which encapsulates web-web communications. Given a url (weblink) rvest will read the page html or xml.

url <- "https://www.gov.uk/government/organisations/department-for-education"

read_html(url)[2]

We can then extract information based on the structure of the html using the tags.

For example we can extract links or paragraphs...

read_html(url) %>%
  html_nodes("a") %>%
  html_attr("href")

read_html(url) %>%
  html_nodes("p") %>%
  .[4] %>%
  html_text()

or links...

read_html(url) %>%
  html_nodes("a") %>%
  .[1:6] %>%
  html_attr("href")

Using selector gadget

Sometimes web pages are laid out in a more complex manner or use style sheets, and the page developers may use individualised tags to layout the page. Selcector gadget can help identify these tags and aid data extraction.

Lets look at PHEs webpages and try to extract infromation.

https://www.gov.uk/government/organisations/public-health-england

url <- "https://www.gov.uk/government/organisations/public-health-england"


read_html(url) %>%
  html_nodes(".column-half") %>%
  .[1:10] %>%
  html_text()

myScrapers

We have built a package based on these ideas to facilitate extracting data from webpages.

Installing the package

The package is only available on Github and can be downloaded using devtools.

library(devtools)
devtools::install_github("julianflowers/myScrapers")

Simple web-scraping

<<<<<<< HEAD In myScrapers there are 2 primary functions built on rvest:

Both are built on top of web commands which underpin the exchange of information across the internet.

In R, given any weblink or url we can read the site with the GET function in httr.

Lets use https:\fingertips.phe.org.uk...

url <- "https://fingertips.phe.org.uk"

html <- GET(url)

html
rvest <- read_html(url)
rvest

rvest %>%
  html_nodes("a") %>%
  .[1] %>%
  html_attr("href")

In myScrapers there are 4 primary functions built on these:

Examples

We can use get_page_links to extract information from following page of PHE statistical releases. https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england

url <- "https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england"

get_page_links(url) %>%
  .[19:40]

Use cases

We'll use GP in hours syndromic surveillance data to illustrate further uses. This report "Monitors the number of people who visit their GP during surgery hours under the syndromic surveillance system."

The system publishes weekly reports and spreadsheets - to obtain a year's worth of these reports manually would require 104 separate downloads.

Using a webscraping approach this can be achieved in a few lines of code.

Identifying reports

The code below identifies all the pdf reports on the page.

urls <- "https://www.gov.uk/government/publications/gp-in-hours-weekly-bulletins-for-2018"

get_page_docs(urls) %>%
  head(10) %>%
  unique()

We can then use the downloader package to download the pdfs:

## not run
library(downloader)

(get_page_links(urls) %>%
  .[grepl("pdf$", .)] %>%
  head(10) %>%
  unique() %>%
  purrr::map(., ~download(.x, destfile = basename(.x))))

Identifying data (spreadsheet links)

We can take a similar approach to spreadsheets.

urls <- "https://www.gov.uk/government/publications/gp-in-hours-weekly-bulletins-for-2018"

get_page_links(urls) %>%
  .[grepl("xls.?$", .)] %>%
  head(4) %>%
  unique() %>%
<<<<<<< HEAD
  map(., ~downloader::download(.x, destfile = basename(.x)))
=======
  purrr::map(., ~downloader::download(.x, destfile = basename(.x)))
>>>>>>> e158c7a37faff105fa38197f6b669520774b20af

Having downloaded the reports or spreadsheets it is now straightforward to import them for further analysis.

library(readxl)
files <- list.files(pattern = ".xls")

<<<<<<< HEAD
data <- purrr::map(files[2], ~(read_excel(.x,  sheet = "Local Authority", na = "*", 
=======
data <- purrr::map(files, ~(read_excel(.x,  sheet = "Local Authority", na = "*", 
>>>>>>> e158c7a37faff105fa38197f6b669520774b20af
    skip = 4)))

head(data)

Scraping text

This follows the same principle. The function get_page_text is designed to extract text from a webpage. For example, we can extract the text from an article about Matt Hancock's description of "predictive prevention'.

r get_page_text("https://www.thetimes.co.uk/article/nhs-will-use-phone-data-to-predict-threats-to-your-health-r7085zqfq") %>% .[1:4]

Analysing Duncan Selbie's Friday messages

Using simple functions it is relatively easy to scrape Duncan Selbie's blogs into a data frame for further analysis.

The base URL is https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/, and there are 8 pages of results so the first task is to create a list of urls.

url_ds <- "https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/"
url_ds1 <- paste0(url_ds, "page/", 2:8)
urls_ds <- c(url_ds, url_ds1)

Then we can extract links and isolate those specific to the friday messages

links <- purrr::map(urls_ds, ~(get_page_links(.x))) 

friday_message <- links %>% purrr::flatten() %>%.[grepl("duncan-selbies-friday-message", .)] %>% .[!grepl("comments", .)] %>% unique()

head(friday_message)

and then extract blog text:

library(tm)
library(magrittr)

blog_text <- purrr::map(friday_message, ~(get_page_text(.x)))
blog_text <- purrr::map(blog_text, ~(str_remove(.x, "\\n")))
blog_text <- purrr::map(blog_text, ~(str_remove(.x, "    GOV.UK blogs use cookies to make the site simpler. Find out more about cookies\n  ")))
blog_text <- purrr::map(blog_text, ~(str_remove(.x, "Dear everyone")))

blog_title <- purrr::map(blog_text, 2)
names(blog_text) <- blog_title

blog_text1 <- purrr::map(blog_text, extract, 5:11)
blog_text2 <- purrr::map(blog_text1, data.frame)
blog_text2 <- purrr::map_df(blog_text2, bind_rows)
blog_text2 <- blog_text2 %>% mutate(text = clean_texts(.x..i..))

We can then visualise with, for example, a wordcloud.

library(quanteda)

corp <- corpus(blog_text2$text)
dfm <- dfm(corp, ngrams = 2, remove = c("government_licence", "open_government", "public_health", "official_blog", "blog_public", "health_england", "cancel_reply", "content available", "health_blog", "licence_v", "best_wishes", "otherwise_stated", "except_otherwise", "friday_messages", "best_wishes", 
                                        "available_open"))

<<<<<<< HEAD
textplot_wordcloud(dfm, color = viridis::magma(10))
=======
textplot_wordcloud(dfm, color = viridis::plasma(n = 10))
>>>>>>> e158c7a37faff105fa38197f6b669520774b20af

Additional functions

I have added a few functions to the package.

get_dsph_england returns a list of local authorities and their current DsPH. It scrapes https://www.gov.uk/government/publications/directors-of-public-health-in-england--2/directors-of-public-health-in-england

dsph <- get_dsph_england()
dsph %>%
  knitr::kable()

get_phe_catalogue identifies all the PHE publications on GOV.UK. For this function you have to set the n = argument. We recommend starting at n = 110. This produces an interactive searchable table of links.

cat <- get_phe_catalogue(n=110)

cat

Reviewing parliamentary questions (PQs)

We have added a get_pq function built on the hansard package to extract PQs addressed to, answered by or mentioning PHE. This takes a start date as an argument in the form yyyy-mm-dd.

<<<<<<< HEAD
pqs <- get_pqs(start_date = "2019-01-01")
=======
pqs <- get_pqs(start_date = "2019-03-01")
>>>>>>> e158c7a37faff105fa38197f6b669520774b20af

pqs

We can look at the categories of questions asked.

pqs %>%
  group_by(hansard_category) %>%
  count() %>%
  arrange(-n) %>%
  top_n(10)

Using myScrapers to extract NICE guidance

We can use the toolkit to extract NICE Public Health Guidance as follows:

url <- "https://www.nice.org.uk/guidance/published?type=ph"

links <- get_page_links(url)[13:22] ##first 10  sets of guidance
links1 <- purrr::map(links, ~(paste0("https://www.nice.org.uk", .x, "/chapter/Recommendations")))

pander::panderOptions("table.style", "multiline")
pander::panderOptions("table.alignment.default", "left")
recommendations <- purrr::map(links1, ~(get_page_text(.x))) %>% purrr::map(., data.frame) 
head(recommendations, 1) %>% 
  knitr::kable()
search <- "google analytics understanding users"

g <- googlesearchR(search, n = 10)

g[13] %>% as.character() %>% get_page_text(.) %>%


julianflowers/myScrapers documentation built on May 10, 2023, 7:17 a.m.