In julianflowers/myScrapers: Functions to extract data from PHE websites and identify PH information on the web

knitr::opts_chunk$set(
  collapse = TRUE,
  warning = FALSE, 
  message = FALSE, 
  cache = TRUE, 
  echo = TRUE, 
  comment = "#>"
)

library(tidyverse)
library(myScrapers)
library(Rcrawler)
library(rvest)

This vignette illustrates some of the basics of web-scraping and some features of the myScrapers package - in particular simple web-scraping functions. We also show some functions in the package specifically designed to retrieve public health information for public health practitioners.

Web scraping basics in R

The basic toolkit is:

a Url (link) you want to obtain data from
the rvest package in R or beautiful soup in Python
the selector gadget extension for web browsers

A basic knowledge of html is helpful - especially elements and tags. For example paragraphs are defined with the <p> tag, links with the <a> tag and so on. ("https://www.w3schools.com/html/html_intro.asp")`

Selectorgadget enables finding tags used in style sheets if page layout is more sophisticated.

Using `rvest`

rvest is built on the httr packages which encapsulates web-web communications. Given a url (weblink) rvest will read the page html or xml.

url <- "https://www.gov.uk/government/organisations/department-for-education"

read_html(url)[2]

We can then extract information based on the structure of the html using the tags.

For example we can extract links or paragraphs...

read_html(url) %>%
  html_nodes("a") %>%
  html_attr("href")

read_html(url) %>%
  html_nodes("p") %>%
  .[4] %>%
  html_text()

or links...

read_html(url) %>%
  html_nodes("a") %>%
  .[1:6] %>%
  html_attr("href")

Using selector gadget

Sometimes web pages are laid out in a more complex manner or use style sheets, and the page developers may use individualised tags to layout the page. Selcector gadget can help identify these tags and aid data extraction.

Lets look at PHEs webpages and try to extract infromation.

https://www.gov.uk/government/organisations/public-health-england

url <- "https://www.gov.uk/government/organisations/public-health-england"


read_html(url) %>%
  html_nodes(".column-half") %>%
  .[1:10] %>%
  html_text()

myScrapers

We have built a package based on these ideas to facilitate extracting data from webpages.

Installing the package

The package is only available on Github and can be downloaded using devtools.

library(devtools)
devtools::install_github("julianflowers/myScrapers")

Simple web-scraping

<<<<<<< HEAD In myScrapers there are 2 primary functions built on rvest:

get_page_links which identifies the links on a webpage
get_page_text which extracts text from a webpage
get_page_docs identifies pdfs or .doc* files on a page
get_page_csv identifies spreadsheet links
In Python beautiful soup is the main package for web scraping.

Both are built on top of web commands which underpin the exchange of information across the internet.

In R, given any weblink or url we can read the site with the GET function in httr.

Lets use https:\fingertips.phe.org.uk...

url <- "https://fingertips.phe.org.uk"

html <- GET(url)

html

rvest <- read_html(url)
rvest

rvest %>%
  html_nodes("a") %>%
  .[1] %>%
  html_attr("href")

In myScrapers there are 4 primary functions built on these:

get_page_links which identifies the links on a webpage
get_page_text which extracts text from a webpage
get_page_docs which identifies pdf or doc files on a page
get_page_csv which finds csv files and spreadsheets on a page

e158c7a37faff105fa38197f6b669520774b20af

Examples

We can use get_page_links to extract information from following page of PHE statistical releases. https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england

url <- "https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england"

get_page_links(url) %>%
  .[19:40]

Use cases

We'll use GP in hours syndromic surveillance data to illustrate further uses. This report "Monitors the number of people who visit their GP during surgery hours under the syndromic surveillance system."

The system publishes weekly reports and spreadsheets - to obtain a year's worth of these reports manually would require 104 separate downloads.

Using a webscraping approach this can be achieved in a few lines of code.

Identifying reports

The code below identifies all the pdf reports on the page.

urls <- "https://www.gov.uk/government/publications/gp-in-hours-weekly-bulletins-for-2018"

get_page_docs(urls) %>%
  head(10) %>%
  unique()

We can then use the downloader package to download the pdfs:

## not run
library(downloader)

(get_page_links(urls) %>%
  .[grepl("pdf$", .)] %>%
  head(10) %>%
  unique() %>%
  purrr::map(., ~download(.x, destfile = basename(.x))))

Identifying data (spreadsheet links)

We can take a similar approach to spreadsheets.

urls <- "https://www.gov.uk/government/publications/gp-in-hours-weekly-bulletins-for-2018"

get_page_links(urls) %>%
  .[grepl("xls.?$", .)] %>%
  head(4) %>%
  unique() %>%
<<<<<<< HEAD
  map(., ~downloader::download(.x, destfile = basename(.x)))
=======
  purrr::map(., ~downloader::download(.x, destfile = basename(.x)))
>>>>>>> e158c7a37faff105fa38197f6b669520774b20af

Having downloaded the reports or spreadsheets it is now straightforward to import them for further analysis.

library(readxl)
files <- list.files(pattern = ".xls")

<<<<<<< HEAD
data <- purrr::map(files[2], ~(read_excel(.x,  sheet = "Local Authority", na = "*", 
=======
data <- purrr::map(files, ~(read_excel(.x,  sheet = "Local Authority", na = "*", 
>>>>>>> e158c7a37faff105fa38197f6b669520774b20af
    skip = 4)))

head(data)

Scraping text

This follows the same principle. The function get_page_text is designed to extract text from a webpage. For example, we can extract the text from an article about Matt Hancock's description of "predictive prevention'.

r get_page_text("https://www.thetimes.co.uk/article/nhs-will-use-phone-data-to-predict-threats-to-your-health-r7085zqfq") %>% .[1:4]

Analysing Duncan Selbie's Friday messages

Using simple functions it is relatively easy to scrape Duncan Selbie's blogs into a data frame for further analysis.

The base URL is https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/, and there are 8 pages of results so the first task is to create a list of urls.

url_ds <- "https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/"
url_ds1 <- paste0(url_ds, "page/", 2:8)
urls_ds <- c(url_ds, url_ds1)

Then we can extract links and isolate those specific to the friday messages

links <- purrr::map(urls_ds, ~(get_page_links(.x))) 

friday_message <- links %>% purrr::flatten() %>%.[grepl("duncan-selbies-friday-message", .)] %>% .[!grepl("comments", .)] %>% unique()

head(friday_message)

and then extract blog text:

library(tm)
library(magrittr)

blog_text <- purrr::map(friday_message, ~(get_page_text(.x)))
blog_text <- purrr::map(blog_text, ~(str_remove(.x, "\\n")))
blog_text <- purrr::map(blog_text, ~(str_remove(.x, "    GOV.UK blogs use cookies to make the site simpler. Find out more about cookies\n  ")))
blog_text <- purrr::map(blog_text, ~(str_remove(.x, "Dear everyone")))

blog_title <- purrr::map(blog_text, 2)
names(blog_text) <- blog_title

blog_text1 <- purrr::map(blog_text, extract, 5:11)
blog_text2 <- purrr::map(blog_text1, data.frame)
blog_text2 <- purrr::map_df(blog_text2, bind_rows)
blog_text2 <- blog_text2 %>% mutate(text = clean_texts(.x..i..))

We can then visualise with, for example, a wordcloud.

library(quanteda)

corp <- corpus(blog_text2$text)
dfm <- dfm(corp, ngrams = 2, remove = c("government_licence", "open_government", "public_health", "official_blog", "blog_public", "health_england", "cancel_reply", "content available", "health_blog", "licence_v", "best_wishes", "otherwise_stated", "except_otherwise", "friday_messages", "best_wishes", 
                                        "available_open"))

<<<<<<< HEAD
textplot_wordcloud(dfm, color = viridis::magma(10))
=======
textplot_wordcloud(dfm, color = viridis::plasma(n = 10))
>>>>>>> e158c7a37faff105fa38197f6b669520774b20af

Additional functions

I have added a few functions to the package.

get_dsph_england returns a list of local authorities and their current DsPH. It scrapes https://www.gov.uk/government/publications/directors-of-public-health-in-england--2/directors-of-public-health-in-england

dsph <- get_dsph_england()
dsph %>%
  knitr::kable()

get_phe_catalogue identifies all the PHE publications on GOV.UK. For this function you have to set the n = argument. We recommend starting at n = 110. This produces an interactive searchable table of links.

cat <- get_phe_catalogue(n=110)

cat

Reviewing parliamentary questions (PQs)

We have added a get_pq function built on the hansard package to extract PQs addressed to, answered by or mentioning PHE. This takes a start date as an argument in the form yyyy-mm-dd.

<<<<<<< HEAD
pqs <- get_pqs(start_date = "2019-01-01")
=======
pqs <- get_pqs(start_date = "2019-03-01")
>>>>>>> e158c7a37faff105fa38197f6b669520774b20af

pqs

We can look at the categories of questions asked.

pqs %>%
  group_by(hansard_category) %>%
  count() %>%
  arrange(-n) %>%
  top_n(10)

Using `myScrapers` to extract NICE guidance

We can use the toolkit to extract NICE Public Health Guidance as follows:

Firstly we'll identify the URLs for NICE PH guidance - they related to https://www.nice.org.uk/guidance/published?type=ph
Then create a full URL for recommendations
Then extract the text

url <- "https://www.nice.org.uk/guidance/published?type=ph"

links <- get_page_links(url)[13:22] ##first 10  sets of guidance
links1 <- purrr::map(links, ~(paste0("https://www.nice.org.uk", .x, "/chapter/Recommendations")))

pander::panderOptions("table.style", "multiline")
pander::panderOptions("table.alignment.default", "left")
recommendations <- purrr::map(links1, ~(get_page_text(.x))) %>% purrr::map(., data.frame) 
head(recommendations, 1) %>% 
  knitr::kable()

search <- "google analytics understanding users"

g <- googlesearchR(search, n = 10)

g[13] %>% as.character() %>% get_page_text(.) %>%

julianflowers/myScrapers documentation built on Aug. 7, 2024, 3:17 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

julianflowers/myScrapers
Functions to extract data from PHE websites and identify PH information on the web

In julianflowers/myScrapers: Functions to extract data from PHE websites and identify PH information on the web

Web scraping basics in R

Using `rvest`

Using selector gadget

myScrapers

Installing the package

Simple web-scraping

`get_page_csv` identifies spreadsheet links

Examples

Use cases

Identifying reports

Identifying data (spreadsheet links)

Scraping text

Analysing Duncan Selbie's Friday messages

Additional functions

Reviewing parliamentary questions (PQs)

Using `myScrapers` to extract NICE guidance

R Package Documentation

Browse R Packages

We want your feedback!

julianflowers/myScrapers Functions to extract data from PHE websites and identify PH information on the web

In julianflowers/myScrapers: Functions to extract data from PHE websites and identify PH information on the web

Web scraping basics in R

Using rvest

Using selector gadget

myScrapers

Installing the package

Simple web-scraping

get_page_csv identifies spreadsheet links

Examples

Use cases

Identifying reports

Identifying data (spreadsheet links)

Scraping text

Analysing Duncan Selbie's Friday messages

Additional functions

Reviewing parliamentary questions (PQs)

Using myScrapers to extract NICE guidance

R Package Documentation

Browse R Packages

We want your feedback!

julianflowers/myScrapers
Functions to extract data from PHE websites and identify PH information on the web

Using `rvest`

`get_page_csv` identifies spreadsheet links

Using `myScrapers` to extract NICE guidance