knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, message = FALSE, cache = TRUE, echo = TRUE, comment = "#>" ) library(tidyverse) library(myScrapers) library(Rcrawler) library(rvest)
This vignette illustrates some of the basics of web-scraping and some features of the myScrapers package - in particular simple web-scraping functions. We also show some functions in the package specifically designed to retrieve public health information for public health practitioners.
The basic toolkit is:
rvest package in R or beautiful soup in PythonA basic knowledge of html is helpful - especially elements and tags. For example paragraphs are defined with the <p> tag, links with the <a> tag and so on.
("https://www.w3schools.com/html/html_intro.asp")`
Selectorgadget enables finding tags used in style sheets if page layout is more sophisticated.
rvestrvest is built on the httr packages which encapsulates web-web communications. Given a url (weblink) rvest will read the page html or xml.
url <- "https://www.gov.uk/government/organisations/department-for-education" read_html(url)[2]
We can then extract information based on the structure of the html using the tags.
For example we can extract links or paragraphs...
read_html(url) %>% html_nodes("a") %>% html_attr("href") read_html(url) %>% html_nodes("p") %>% .[4] %>% html_text()
or links...
read_html(url) %>% html_nodes("a") %>% .[1:6] %>% html_attr("href")
Sometimes web pages are laid out in a more complex manner or use style sheets, and the page developers may use individualised tags to layout the page. Selcector gadget can help identify these tags and aid data extraction.
Lets look at PHEs webpages and try to extract infromation.
https://www.gov.uk/government/organisations/public-health-england
url <- "https://www.gov.uk/government/organisations/public-health-england" read_html(url) %>% html_nodes(".column-half") %>% .[1:10] %>% html_text()
We have built a package based on these ideas to facilitate extracting data from webpages.
The package is only available on Github and can be downloaded using devtools.
library(devtools) devtools::install_github("julianflowers/myScrapers")
<<<<<<< HEAD
In myScrapers there are 2 primary functions built on rvest:
get_page_links which identifies the links on a webpageget_page_text which extracts text from a webpageget_page_docs identifies pdfs or .doc* files on a pageget_page_csv identifies spreadsheet linksbeautiful soup is the main package for web scraping.Both are built on top of web commands which underpin the exchange of information across the internet.
In R, given any weblink or url we can read the site with the GET function in httr.
Lets use https:\fingertips.phe.org.uk...
url <- "https://fingertips.phe.org.uk" html <- GET(url) html
rvest <- read_html(url) rvest rvest %>% html_nodes("a") %>% .[1] %>% html_attr("href")
In myScrapers there are 4 primary functions built on these:
get_page_links which identifies the links on a webpageget_page_text which extracts text from a webpageget_page_docs which identifies pdf or doc files on a pageget_page_csv which finds csv files and spreadsheets on a pagee158c7a37faff105fa38197f6b669520774b20af
We can use get_page_links to extract information from following page of PHE statistical releases. https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england
url <- "https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england" get_page_links(url) %>% .[19:40]
We'll use GP in hours syndromic surveillance data to illustrate further uses. This report "Monitors the number of people who visit their GP during surgery hours under the syndromic surveillance system."
The system publishes weekly reports and spreadsheets - to obtain a year's worth of these reports manually would require 104 separate downloads.
Using a webscraping approach this can be achieved in a few lines of code.
The code below identifies all the pdf reports on the page.
urls <- "https://www.gov.uk/government/publications/gp-in-hours-weekly-bulletins-for-2018" get_page_docs(urls) %>% head(10) %>% unique()
We can then use the downloader package to download the pdfs:
## not run library(downloader) (get_page_links(urls) %>% .[grepl("pdf$", .)] %>% head(10) %>% unique() %>% purrr::map(., ~download(.x, destfile = basename(.x))))
We can take a similar approach to spreadsheets.
urls <- "https://www.gov.uk/government/publications/gp-in-hours-weekly-bulletins-for-2018" get_page_links(urls) %>% .[grepl("xls.?$", .)] %>% head(4) %>% unique() %>% <<<<<<< HEAD map(., ~downloader::download(.x, destfile = basename(.x))) ======= purrr::map(., ~downloader::download(.x, destfile = basename(.x))) >>>>>>> e158c7a37faff105fa38197f6b669520774b20af
Having downloaded the reports or spreadsheets it is now straightforward to import them for further analysis.
library(readxl) files <- list.files(pattern = ".xls") <<<<<<< HEAD data <- purrr::map(files[2], ~(read_excel(.x, sheet = "Local Authority", na = "*", ======= data <- purrr::map(files, ~(read_excel(.x, sheet = "Local Authority", na = "*", >>>>>>> e158c7a37faff105fa38197f6b669520774b20af skip = 4))) head(data)
This follows the same principle. The function get_page_text is designed to extract text from a webpage. For example, we can extract the text from an article about Matt Hancock's description of "predictive prevention'.
r get_page_text("https://www.thetimes.co.uk/article/nhs-will-use-phone-data-to-predict-threats-to-your-health-r7085zqfq") %>% .[1:4]
Using simple functions it is relatively easy to scrape Duncan Selbie's blogs into a data frame for further analysis.
The base URL is https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/, and there are 8 pages of results so the first task is to create a list of urls.
url_ds <- "https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/" url_ds1 <- paste0(url_ds, "page/", 2:8) urls_ds <- c(url_ds, url_ds1)
Then we can extract links and isolate those specific to the friday messages
links <- purrr::map(urls_ds, ~(get_page_links(.x))) friday_message <- links %>% purrr::flatten() %>%.[grepl("duncan-selbies-friday-message", .)] %>% .[!grepl("comments", .)] %>% unique() head(friday_message)
and then extract blog text:
library(tm) library(magrittr) blog_text <- purrr::map(friday_message, ~(get_page_text(.x))) blog_text <- purrr::map(blog_text, ~(str_remove(.x, "\\n"))) blog_text <- purrr::map(blog_text, ~(str_remove(.x, " GOV.UK blogs use cookies to make the site simpler. Find out more about cookies\n "))) blog_text <- purrr::map(blog_text, ~(str_remove(.x, "Dear everyone"))) blog_title <- purrr::map(blog_text, 2) names(blog_text) <- blog_title blog_text1 <- purrr::map(blog_text, extract, 5:11) blog_text2 <- purrr::map(blog_text1, data.frame) blog_text2 <- purrr::map_df(blog_text2, bind_rows) blog_text2 <- blog_text2 %>% mutate(text = clean_texts(.x..i..))
We can then visualise with, for example, a wordcloud.
library(quanteda) corp <- corpus(blog_text2$text) dfm <- dfm(corp, ngrams = 2, remove = c("government_licence", "open_government", "public_health", "official_blog", "blog_public", "health_england", "cancel_reply", "content available", "health_blog", "licence_v", "best_wishes", "otherwise_stated", "except_otherwise", "friday_messages", "best_wishes", "available_open")) <<<<<<< HEAD textplot_wordcloud(dfm, color = viridis::magma(10)) ======= textplot_wordcloud(dfm, color = viridis::plasma(n = 10)) >>>>>>> e158c7a37faff105fa38197f6b669520774b20af
I have added a few functions to the package.
get_dsph_england returns a list of local authorities and their current DsPH. It scrapes https://www.gov.uk/government/publications/directors-of-public-health-in-england--2/directors-of-public-health-in-england
dsph <- get_dsph_england() dsph %>% knitr::kable()
get_phe_catalogue identifies all the PHE publications on GOV.UK. For this function you have to set the n = argument. We recommend starting at n = 110. This produces an interactive searchable table of links.
cat <- get_phe_catalogue(n=110) cat
We have added a get_pq function built on the hansard package to extract PQs addressed to, answered by or mentioning PHE. This takes a start date as an argument in the form yyyy-mm-dd.
<<<<<<< HEAD pqs <- get_pqs(start_date = "2019-01-01") ======= pqs <- get_pqs(start_date = "2019-03-01") >>>>>>> e158c7a37faff105fa38197f6b669520774b20af pqs
We can look at the categories of questions asked.
pqs %>% group_by(hansard_category) %>% count() %>% arrange(-n) %>% top_n(10)
myScrapers to extract NICE guidanceWe can use the toolkit to extract NICE Public Health Guidance as follows:
Firstly we'll identify the URLs for NICE PH guidance - they related to https://www.nice.org.uk/guidance/published?type=ph
Then create a full URL for recommendations
Then extract the text
url <- "https://www.nice.org.uk/guidance/published?type=ph" links <- get_page_links(url)[13:22] ##first 10 sets of guidance links1 <- purrr::map(links, ~(paste0("https://www.nice.org.uk", .x, "/chapter/Recommendations"))) pander::panderOptions("table.style", "multiline") pander::panderOptions("table.alignment.default", "left") recommendations <- purrr::map(links1, ~(get_page_text(.x))) %>% purrr::map(., data.frame) head(recommendations, 1) %>% knitr::kable()
search <- "google analytics understanding users" g <- googlesearchR(search, n = 10) g[13] %>% as.character() %>% get_page_text(.) %>%
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.