knitr::opts_chunk$set( # code chunk options echo = TRUE , eval = TRUE , warning = FALSE , message = FALSE , cached = FALSE , exercise = TRUE )
library(learnr) library(learn2scrape) library(rvest) library(urltools) quotepage <- read_html(system.file("extdata", "quotepage.html", package = "learn2scrape"))
In tutorial "201-rvest-introduction", you have learned how to use some of the basic functions of the rvest
package: read_html()
, html_elements()
and html_text()
.
But most of the time, we are not just interested in a single page but multiple pages from the same domain, e.g. all newspaper reports by a specific newspaper or all speeches by a politician.
So we usually need another step in our data extraction pipeline. Accordingly, you will learn two things:
You will see that effective web scraping relies on some fundamental R programming techniques. So the more you get used to write loops and functions, the easier will it be to effectively solve real-world web scraping problems. Once you manage these tasks, web scraping will become in an easy-to-handle step of your data collection projects.
We will use the following R packages in this tutorial:
library(rvest) library(urltools)
Our first goal is to extract a number of links from a web page.
In your browser you recognize links by
Now we've already seen that what you see in your browser is determined by underlying HTML code. So we first need to understand how links are defined in HTML.
We have previously discussed HTML elements and their properties. You will remember that each HTML element has a certain type that is described by its "tag" and that individual elements can have attributes associated with them. For example, a paragraph HTML element may belong to class "quote", which would be written in HTML as
<p class="quote">...</p>
This knowledge helps us to understand how links can appear on a webpage. An example of HTML code is shown below:
<p> This is some text <a href="http://quotes.toscrape.com/">with a link</a>. </p>
The first part of this sentence ("This is some text") would appear as ordinary text. However, the second part ("with a link") would appear differently and clicking on it would direct you to http://quotes.toscrape.com/:
This is some text with a link.
This is because the text "with a link" is wrapped in an anchor tag. So links are basically anchor elements containing an 'href' attribute. 'href' stands for hypertext reference and specifies the webpage the link leads to.
rvest
So to extract hyperlinks from a webpage, we need some functionality to
In the rvest
package, this can be achieved with the html_elements()
and html_attr()
functions respectively.
How you use html_elements()
to select elements by their tag name has already been covered in a previous tutorial: you simply pass the tag name to the css
argument of html_elements()
.
html_attr()
, in turn, can be used to extract any type of attributes from HTML elements and below we'll use it to extract hyperlink information from anchor elements.
To do so, we pass the value 'href' to the name
argument of the html_attr()
function.
For example:
html_attr(parsed_page, name = "href")
Note, however, that attributes like 'href' are associated with individual web elements.
So calling html_attr()
will only work on individual web elements, not on an entire pages or list of elements.
So to extract hyperlinks from an entire page, we
We can implement the first step using html_elements()
and the second step by applying html_attr()
to each extracted element.
Try it yourself: Please try two things based on http://quotes.toscrape.com/{target="_blank"}:
url <- "http://quotes.toscrape.com/" page <- read_html(url) ## ToDo: extract all anchor elements (i.e., web elements with an 'a' tag) anchor_elements <- ... ## ToDo: extract all 'href' values of these elements hyperlinks <- ...
Do you notice something about the links? Some of them are missing parts!
This is because they are relative links, that is, they specify pages relative to the root of the folder structure of the webpage.
To "repair" these links, we need to add the base URL of the webpage.
This is typically just the URL of the webpage we originally scraped from; in our case "http://quotes.toscrape.com/"
.
For adding the base URL in front of the relative links, we can use the paste()
function.
paste()
combines/glues/concatenates together two or more character values.
Note that paste()
accepts an argument sep
that specifies how the individual values should be separated.
By default, sep = " "
so you'll always add a white space between individual values.
To avoid this, you can either set sep = ""
or directly use use paste0()
, which overwrites this default with sep = ""
so that characters are combined without inserting a white space in between.
Try it yourself if you have never used paste:
paste("a", "b") paste("a", "b", sep = "") paste0("a", "b")
Now, completing the paths of the URLs we scraped should not be a problem for you.
Re-use the code you used to extract the links of the tags, assign it to an object called url
and add the base url (http://quotes.toscrape.com/) in front of it.
url <- "http://quotes.toscrape.com/" page <- read_html(url) anchor_elements <- html_elements(page, "a") rel_links <- html_attr(anchor_elements, "href") # ToDo: add the base URL in front hyperlinks <- ...
Caution: Watch out for the slashes between the base url and the address of your page - having none or too many slashes is a typical problem!*
To make 100% sure that you are adding the right information to relative links, you can use the url_parse
function in the 'urltools' package to get at the base URL of a page.
parsed_url <- urltools::url_parse("http://quotes.toscrape.com/") str(parsed_url, 1) # combine sheme (e.g., 'https') and domain info to get base URL base_url <- paste0(parsed_url$scheme, "://", parsed_url$domain)
Another thing you'll have noticed in the previous exercise is that by extracting all anchor elements and their corresponding 'href' values, we get a lot of (relative) links we might not be interested in. In the above exercise, the first five extracted (relative) links were:
[1] "/" [2] "/login" [3] "/author/Albert-Einstein" [4] "/tag/change/page/1/" [5] "/tag/deep-thoughts/page/1/"
The first one points to the page itself, the second one to the login page, the third one to the author of the first quote, and the last two to tags associated with the first quote.
Depending on your use case, this can be a problem! As a running example, suppose you want to build a list of all the tags that are used on http://quotes.toscrape.com/. In this case you wouldn't want to also extract the links to author pages or the login page.
This is a typical challenge in web scraping: we usually want to extract information from only a subset of of the web elements that make up a web page.
We tackle this challenge by combining the different functions we can execute on the parsed HTML code of a page in a pipeline.
In the case of extracting the links behind individual tags, for example, we would
Let's try this in our example of extracting the links behind individual tags.
First, we need to parse the page. We already know how to do this:
# parse URL url <- "http://quotes.toscrape.com/" page <- read_html(url)
Next, we need to figure out what HTML code makes web elements appear as "tag". Again, you have already learned ways to answer this question. For example, you can use SelectorGadget or simply inspect the page's HTML soruce code. Go to http://quotes.toscrape.com/ to answer the question below (multiple correct answers)!
quiz( caption = NULL, question("What piece of HTML code makes web elements appear as 'tag'?", answer("an 'a' HTML tag", message="Almost correct! All 'tag' elements are anchor elements, but not all anchor elements are 'tag' elements!"), answer("class='tag'", correct = TRUE), answer("The relative path strats with 'tag/'", correct = TRUE), answer("a 'meta' HTML tag", message="Not quite! All 'tag' elements are nested in a 'meta' element, but individual 'tag' elements are not 'meta' elements"), allow_retry = TRUE, random_answer_order = TRUE ) )
So we now know that there are two ways to identify 'tag' elements on this page, one of which we know how to implement from a previous exercise: extracting web elements based on their class name.
Hint:
In case you don't remember this, you can extract web elements based on their class name with use html_elements()
and the css
argument by passing the literal class name and a specific punctuation character in front.
For example, elements of class 'text' can be addressed with css = '.text'
.
With this information, we can extract all these elements and their 'href' values. Finally, we add the base URL:
# parse URL url <- "http://quotes.toscrape.com/" page <- read_html(url) # extract 'tag' elements based on class name tags <- html_elements(page, ".tag") tag_hrefs <- html_attr(tags, "href") tag_urls <- paste0(url, tag_hrefs)
The previous example illustrates that a crucial skill in web scraping is to first develop a clear understanding of the individual steps are you need to complete in order to to reliably extract the data you are interested in. You can think of this as a divide-and-conquer approach: you divide a big task into several, smaller tasks and then solve these tasks one after another until everything is done. In this way, you can concentrate on solving one problem after another without getting overwhelmed by the bigger picture.
Take another toy example: Suppose you want to extract only the hyperlinks referring to the pages of quoted authors. Next, you want to extract the information where and when each author was born (if available) from individual author pages (e.g., http://quotes.toscrape.com/author/Albert-Einstein/)
Once we have collected all relevant hyperlinks, there are multiple ways to achieve this:
for
-loop that loops over the vector of links, loads and parses the HTML code from each and scrapes the relevant information from each of themlapply()
the function to a vectorFor now, we will start with the easiest variant and just create a for
-loop.
Later, we will also use lapply()
but there are good reasons why you will often return to simple loops.
Let's brainstorm what we need to accomplish to scrape the relevant information from a single link:
We already know how to accomplish the first step with read_html()
.
We also know already that we can accomplish steps 2 and 3 using html_element()
.
However, what we need to figure out is which CSS selectors we need to pass to html_element()
to extract an author's name, birth date and birth place.
Try it yourself! View the HTML source code or use the SelectorGadget to answer the following questions.
quiz( caption = NULL, #,"CSS slectors identifying author information", question("What CSS selector allows you to unambiguously extract an author's **name**", answer("the 'name' tag"), answer("the 'author-title' class", correct = TRUE), answer("the 'author-details' class", message = "Almost! What you're looking for is nested in this web element!"), allow_retry=TRUE ), question("What CSS selector allows you to unambiguously extract an author's **birth date**", answer("the 'born' tag"), answer("the 'author-born-date' class", correct = TRUE), answer("the 'author-details' class", message = "Almost! What you're looking for is nested in this web element!"), allow_retry=TRUE ), question("What CSS selector allows you to unambiguously extract an author's **birth place**", answer("the 'born' tag"), answer("the 'author-born-location' class", correct = TRUE), answer("the 'author-details' class", message = "Almost! What you're looking for is nested in this web element!"), allow_retry=TRUE ) )
If you have figured this out, it is time to write some R code that extracts this information.
Try it yourself: Write code that extract an author's name, birth date and birth place. Use http://quotes.toscrape.com/author/Jane-Austen/ as an example. Write the results to a data frame with columns
url <- "http://quotes.toscrape.com/author/Jane-Austen/" # parse HTML page <- read_html(url) # To Do: extract the relevant information author_name <- ... author_born_on <- .... author_born_at <- ... # cobmine (column-wise) in a data frame out <- data.frame(...)
Example Solution
# parse HTML url <- "http://quotes.toscrape.com/author/Jane-Austen/" page <- read_html(url) # extract the relevant information author_name <- html_text(html_element(page, ".author-title"), trim = TRUE) author_born_on <- html_text(html_element(page, ".author-born-date"), trim = TRUE) author_born_at <- html_text(html_element(page, ".author-born-location"), trim = TRUE) # combine (column-wise) in a data frame out <- data.frame(author_name, author_born_on, author_born_at)
for
-loopsNow, try to write the code into a loop.
Remember how for
-loops work?
We take a vector and iterate over elements.
In our case this vector is called author_urls
and it records the URLs of the first 4 authors whose quotes are listed on http://quotes.toscrape.com/
url <- "http://quotes.toscrape.com" page <- read_html(url) author_urls <- paste0(url, html_attr(html_elements(page, xpath = "//a[text()='(about)']"), "href"))[1:4]
author_urls
Next, we want to extract the same information from each URL. We can recycle the code above:
Try it yourself!
Scrape the author name, brith date and birth place from each URL in author_urls
using a for
-loop.
Caution:
Remember to pause some seconds between iterations to avoid overloading the server you are sending your requests to!
In case you don't remember from the API tutorials: you can use Sys.sleep()
for this.
# create a list collecting the extracted data results <- list() # To Do: complete the code to make the for-loop work for (...) { # parse HTML (To Do: pass the object name that is returned by the foor loop) page <- read_html(...) # To Do: extract the relevant information author_name <- ... author_born_on <- .... author_born_at <- ... # To Do: combine (column-wise) in a data frame out <- data.frame(...) # To Do: add `out` as an element to results results[[...]] <- out # pause Sys.sleep(3) } # row-bind data frames do.call(rbind, out)
Example Solution
# create a list collecting the extracted data results <- list() # iterate over URLs for (url in author_urls) { # parse HTML page <- read_html(url) # extract the relevant information author_name <- html_text(html_element(page, ".author-title"), trim = TRUE) author_born_on <- html_text(html_element(page, ".author-born-date"), trim = TRUE) author_born_at <- html_text(html_element(page, ".author-born-location"), trim = TRUE) # combine (column-wise) in a data frame out <- data.frame(author_name, author_born_on, author_born_at) # add `out` as an element to results results[[url]] <- out } # row-bind data frames do.call(rbind, results)
You'll have noticed that you have executed the code inside the for
-loop with a different URL in each iteration.
In this example, this works just fine.
Depending on what you do inside the for
-loop, it can be better to wrap the code in the for loop into a custom function.
Some of the pro arguments for this are
When analyzing the code you have written above to extract author information from individual pages,
you'll notice that the only thing that changes between iterations is the URL you are extracting data from.
Hence, our data extraction function should have a parameter that expects this information: url
We can then basically copy-paste the rest of the code to the function body:
scrape_author_page <- function(url) { page <- read_html(url) # To Do: extract the relevant information author_name <- ... author_born_on <- .... author_born_at <- ... # To Do: combine (column-wise) in a data frame out <- data.frame(...) # return data return(out) }
That's already it!
There are several things we could do to make this function more robust:
url
is a valid URLurl
returns a valid HTTP responseurl
does not return a valid HTTP responseTo check that url
is a valid URL, we could for example verify the follwoing logical tests:
url <- "http://quotes.toscrape.com/author/Marilyn-Monroe/" # `url` is a character vector? is.character(url) # only one URL is passed? length(url) == 1L # `url` starts with 'http://' or 'https://'? grepl("^https?://", url)
We could add these using stopifnot()
before calling page <- read_html(url)
in the function body.
To check that requesting url
returns a valid HTTP response, we could wrap page <- read_html(url)
in a the following code:
url <- "http://quotes.toscrape.com/author/Marilyn-Monroe/" # try request URL; catch error (if any) page <- tryCatch(read_html(url), error = function(err) err) # if an error was catched if (inherits(page, "error")) { stop("could not read page") }
Note:
An alternative is to call resp <- httr::GET(url)
and check httr::status_code(resp) == 200
.
However, this makes another HTTP request to the URL in addition to that executed by read_html()
.
This is inefficient and would duplicate the server load we incur.
A better alternative is to use rvest::session()
instead of read_html()
, which is discussed in tutorial "204-rvest-advanced".
Caveat:
It turns out that http://quotes.toscrape.com/author/ awlays returns a valid response (try opening https://quotes.toscrape.com/author/xc%3Cgrgh/ in your browser), so read_html(url)
never raises an error in this particular example.
Our code example should be usefule in many other scenarios, however.
We could do a similar thing with the code extracting the relevant data.
Why?
Because, for example, the code author_name <- html_text(html_element(page, ".author-title"))
implicitly
html_text()
to this elements returns a single character valueIt turns out that in this particular case, these assumptions are well grounded:
page <- read_html("http://quotes.toscrape.com/author/Marilyn-Monroe/") # element that exists tmp <- html_text(html_element(page, ".author-title")) is.character(tmp) & length(tmp) == 1L tmp # element that _does not_ exist tmp <- html_text(html_element(page, "cwergfbaxcyeer")) is.character(tmp) & length(tmp) == 1L tmp
But in other case it might be safe to ensure against such implicit assumptions by implementing error handling. For example:
author_name <- tryCatch( html_text(html_element(page, ".author-title")), error = function(err) NA_character_ )
In our case an example where this might be important is that there is no birth place/date inforamtion reported on an author's page.
If no such data is reported, the corresponding HTML elements we expect to be there will, in fact, not be there!
In this case NA
(not available) is the correct return value and character is the expected return type.
Finally, you could also define beforehand what data you want to return and in what format you want to return it. Above we have returned
We can make this explicit by defining an output object before we execute any other code inside the function body:
out <- data.frame( author_name = NA_character_, author_born_on = NA_character_, author_born_at = NA_character_ ) # verify: # has 1 row? nrow(out) # all columns are type character? all( purrr::map_chr(out, is.character) )
If we then fail to get a valid response from read_html(url)
we can return the default return object out
instead of raising an error:
page <- tryCatch(read_html(url), error = function(err) err) if (inherits(page, "error")) { warning("could not read page", url) return(out) }
In this way we can iterate over many URLs without running the risk that a single failed requests stops our loop from continuing to iterate
Simiarly, we change the data extraction part as follows:
# To Do: extract the relevant information and assign to columns of `out`` out$author_name <- tryCatch( html_text(html_element(page, ".author-title")), error = function(err) NA_character_ ) # ... and so on ```` In this way, *if* there is a matching HTML element and *if* it has text, the `NA` value is overwritten. Otherwise the default `NA_character_` is kept. We can then simply return `out` at the end of the function body ### The improved function at a glance ```r scrape_author_page <- function(url) { # check inputs stopifnot( "`url` must be a character value" = is.character(url) , "`url` must be have only one element" = length(url) == 1L , "`url` must start with 'http://' or 'https://'" = grepl("^https?://", url) ) # define default return object out <- data.frame( author_name = NA_character_, author_born_on = NA_character_, author_born_at = NA_character_ ) # try read page page <- tryCatch(read_html(url), error = function(err) err) if (inherits(page, "error")) { warning("could not read page", url) return(out) } # try extract the relevant information out$author_name <- tryCatch( html_text(html_element(page, ".author-title")), error = function(err) NA_character_ ) out$author_born_on <- tryCatch( html_text(html_element(page, ".author-born-date")), error = function(err) NA_character_ ) out$author_born_at <- tryCatch( html_text(html_element(page, ".author-born-location")), error = function(err) NA_character_ ) # return data return(out) }
Fantastic, you're done with this lesson! We will repeat to similar tasks in the next days, also using apply()
and other ways of looping. Still, for
-loops are super practical for many simple scraping tasks!
The more you learn to use loops, functions and apply commands, the easier the scraping will be. In the end, scraping is just a small step in the whole process of getting data so if you improve your programming skills in R - which is rewarding anyway - you will also get better at scraping in R.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.