knitr::opts_chunk$set( collapse = TRUE, warning = FALSE, message = FALSE, cache = TRUE, echo = TRUE, comment = "#>" ) library(tidyverse) library(myScrapers) library(Rcrawler) library(rvest)
This vignette illustrates some of the basics of web-scraping and some features of the myScrapers
package - in particular simple web-scraping functions. We also show some functions in the package specifically designed to retrieve public health information for public health practitioners.
The basic toolkit is:
rvest
package in R or beautiful soup
in PythonA basic knowledge of html is helpful - especially elements and tags. For example paragraphs are defined with the <p>
tag, links with the <a>
tag and so on.
("https://www.w3schools.com/html/html_intro.asp")`
Selectorgadget enables finding tags used in style sheets if page layout is more sophisticated.
rvest
rvest
is built on the httr
packages which encapsulates web-web communications. Given a url (weblink) rvest
will read the page html or xml.
url <- "https://www.gov.uk/government/organisations/department-for-education" read_html(url)[2]
We can then extract information based on the structure of the html using the tags.
For example we can extract links or paragraphs...
read_html(url) %>% html_nodes("a") %>% html_attr("href") read_html(url) %>% html_nodes("p") %>% .[4] %>% html_text()
or links...
read_html(url) %>% html_nodes("a") %>% .[1:6] %>% html_attr("href")
Sometimes web pages are laid out in a more complex manner or use style sheets, and the page developers may use individualised tags to layout the page. Selcector gadget can help identify these tags and aid data extraction.
Lets look at PHEs webpages and try to extract infromation.
https://www.gov.uk/government/organisations/public-health-england
url <- "https://www.gov.uk/government/organisations/public-health-england" read_html(url) %>% html_nodes(".column-half") %>% .[1:10] %>% html_text()
We have built a package based on these ideas to facilitate extracting data from webpages.
The package is only available on Github and can be downloaded using devtools
.
library(devtools) devtools::install_github("julianflowers/myScrapers")
<<<<<<< HEAD
In myScrapers
there are 2 primary functions built on rvest
:
get_page_links
which identifies the links on a webpageget_page_text
which extracts text from a webpageget_page_docs
identifies pdfs or .doc* files on a pageget_page_csv
identifies spreadsheet linksbeautiful soup
is the main package for web scraping.Both are built on top of web commands which underpin the exchange of information across the internet.
In R, given any weblink or url we can read the site with the GET
function in httr
.
Lets use https:\fingertips.phe.org.uk...
url <- "https://fingertips.phe.org.uk" html <- GET(url) html
rvest <- read_html(url) rvest rvest %>% html_nodes("a") %>% .[1] %>% html_attr("href")
In myScrapers
there are 4 primary functions built on these:
get_page_links
which identifies the links on a webpageget_page_text
which extracts text from a webpageget_page_docs
which identifies pdf or doc files on a pageget_page_csv
which finds csv files and spreadsheets on a pagee158c7a37faff105fa38197f6b669520774b20af
We can use get_page_links
to extract information from following page of PHE statistical releases. https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england
url <- "https://www.gov.uk/government/statistics?departments%5B%5D=public-health-england" get_page_links(url) %>% .[19:40]
We'll use GP in hours syndromic surveillance data to illustrate further uses. This report "Monitors the number of people who visit their GP during surgery hours under the syndromic surveillance system."
The system publishes weekly reports and spreadsheets - to obtain a year's worth of these reports manually would require 104 separate downloads.
Using a webscraping approach this can be achieved in a few lines of code.
The code below identifies all the pdf reports on the page.
urls <- "https://www.gov.uk/government/publications/gp-in-hours-weekly-bulletins-for-2018" get_page_docs(urls) %>% head(10) %>% unique()
We can then use the downloader
package to download the pdfs:
## not run library(downloader) (get_page_links(urls) %>% .[grepl("pdf$", .)] %>% head(10) %>% unique() %>% purrr::map(., ~download(.x, destfile = basename(.x))))
We can take a similar approach to spreadsheets.
urls <- "https://www.gov.uk/government/publications/gp-in-hours-weekly-bulletins-for-2018" get_page_links(urls) %>% .[grepl("xls.?$", .)] %>% head(4) %>% unique() %>% <<<<<<< HEAD map(., ~downloader::download(.x, destfile = basename(.x))) ======= purrr::map(., ~downloader::download(.x, destfile = basename(.x))) >>>>>>> e158c7a37faff105fa38197f6b669520774b20af
Having downloaded the reports or spreadsheets it is now straightforward to import them for further analysis.
library(readxl) files <- list.files(pattern = ".xls") <<<<<<< HEAD data <- purrr::map(files[2], ~(read_excel(.x, sheet = "Local Authority", na = "*", ======= data <- purrr::map(files, ~(read_excel(.x, sheet = "Local Authority", na = "*", >>>>>>> e158c7a37faff105fa38197f6b669520774b20af skip = 4))) head(data)
This follows the same principle. The function get_page_text
is designed to extract text from a webpage. For example, we can extract the text from an article about Matt Hancock's description of "predictive prevention'.
r get_page_text("https://www.thetimes.co.uk/article/nhs-will-use-phone-data-to-predict-threats-to-your-health-r7085zqfq") %>% .[1:4]
Using simple functions it is relatively easy to scrape Duncan Selbie's blogs into a data frame for further analysis.
The base URL is https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/, and there are 8 pages of results so the first task is to create a list of urls.
url_ds <- "https://publichealthmatters.blog.gov.uk/category/duncan-selbie-friday-message/" url_ds1 <- paste0(url_ds, "page/", 2:8) urls_ds <- c(url_ds, url_ds1)
Then we can extract links and isolate those specific to the friday messages
links <- purrr::map(urls_ds, ~(get_page_links(.x))) friday_message <- links %>% purrr::flatten() %>%.[grepl("duncan-selbies-friday-message", .)] %>% .[!grepl("comments", .)] %>% unique() head(friday_message)
and then extract blog text:
library(tm) library(magrittr) blog_text <- purrr::map(friday_message, ~(get_page_text(.x))) blog_text <- purrr::map(blog_text, ~(str_remove(.x, "\\n"))) blog_text <- purrr::map(blog_text, ~(str_remove(.x, " GOV.UK blogs use cookies to make the site simpler. Find out more about cookies\n "))) blog_text <- purrr::map(blog_text, ~(str_remove(.x, "Dear everyone"))) blog_title <- purrr::map(blog_text, 2) names(blog_text) <- blog_title blog_text1 <- purrr::map(blog_text, extract, 5:11) blog_text2 <- purrr::map(blog_text1, data.frame) blog_text2 <- purrr::map_df(blog_text2, bind_rows) blog_text2 <- blog_text2 %>% mutate(text = clean_texts(.x..i..))
We can then visualise with, for example, a wordcloud.
library(quanteda) corp <- corpus(blog_text2$text) dfm <- dfm(corp, ngrams = 2, remove = c("government_licence", "open_government", "public_health", "official_blog", "blog_public", "health_england", "cancel_reply", "content available", "health_blog", "licence_v", "best_wishes", "otherwise_stated", "except_otherwise", "friday_messages", "best_wishes", "available_open")) <<<<<<< HEAD textplot_wordcloud(dfm, color = viridis::magma(10)) ======= textplot_wordcloud(dfm, color = viridis::plasma(n = 10)) >>>>>>> e158c7a37faff105fa38197f6b669520774b20af
I have added a few functions to the package.
get_dsph_england
returns a list of local authorities and their current DsPH. It scrapes https://www.gov.uk/government/publications/directors-of-public-health-in-england--2/directors-of-public-health-in-england
dsph <- get_dsph_england() dsph %>% knitr::kable()
get_phe_catalogue
identifies all the PHE publications on GOV.UK. For this function you have to set the n = argument. We recommend starting at n = 110. This produces an interactive searchable table of links.
cat <- get_phe_catalogue(n=110) cat
We have added a get_pq
function built on the hansard
package to extract PQs addressed to, answered by or mentioning PHE. This takes a start date as an argument in the form yyyy-mm-dd.
<<<<<<< HEAD pqs <- get_pqs(start_date = "2019-01-01") ======= pqs <- get_pqs(start_date = "2019-03-01") >>>>>>> e158c7a37faff105fa38197f6b669520774b20af pqs
We can look at the categories of questions asked.
pqs %>% group_by(hansard_category) %>% count() %>% arrange(-n) %>% top_n(10)
myScrapers
to extract NICE guidanceWe can use the toolkit to extract NICE Public Health Guidance as follows:
Firstly we'll identify the URLs for NICE PH guidance - they related to https://www.nice.org.uk/guidance/published?type=ph
Then create a full URL for recommendations
Then extract the text
url <- "https://www.nice.org.uk/guidance/published?type=ph" links <- get_page_links(url)[13:22] ##first 10 sets of guidance links1 <- purrr::map(links, ~(paste0("https://www.nice.org.uk", .x, "/chapter/Recommendations"))) pander::panderOptions("table.style", "multiline") pander::panderOptions("table.alignment.default", "left") recommendations <- purrr::map(links1, ~(get_page_text(.x))) %>% purrr::map(., data.frame) head(recommendations, 1) %>% knitr::kable()
search <- "google analytics understanding users" g <- googlesearchR(search, n = 10) g[13] %>% as.character() %>% get_page_text(.) %>%
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.