library(learnr) library(tutorial.helpers) library(tidyverse) library(rvest) knitr::opts_chunk$set(echo = FALSE) options(tutorial.exercise.timelimit = 60, tutorial.storage = "local") html_1 <- minimal_html("test") html_2 <- minimal_html(" <h1>This is a heading</h1> <p id='first'>This is a paragraph</p> <p class='important'>This is an important paragraph</p> ") html_3 <- minimal_html(" <ul> <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li> <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li> <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li> <li><b>R4-P17</b> is a <i>droid</i></li> </ul> ") html_4 <- minimal_html(" <table class='mytable'> <tr><th>x</th> <th>y</th></tr> <tr><td>1.5</td> <td>2.7</td></tr> <tr><td>4.9</td> <td>1.3</td></tr> <tr><td>7.2</td> <td>8.1</td></tr> </table> ") # DK: This code is bad because it requires an internet connection to work. Clean # it up so that it works regardless. section <- "https://rvest.tidyverse.org/articles/starwars.html" |> read_html() |> html_elements("section")
This tutorial covers Chapter 24: Web scraping from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We will learn how to use the rvest, along with associated functions like read_html()
and html_table()
to download and organize data from the web.
Web scraping in R is a method of extracting data from websites using R programming language. It can be a valuable skill for students in data-related fields, allowing them to collect and analyze large amounts of data quickly and efficiently.
This section covers the basics of web scraping.
The key package for web scraping with R is rvest. Load the rvest library.
library(rvest)
library(rvest)
The "Web Scraping 101" article from the rvest homepage provides useful background.
Look up the help page for the rvest package using the help()
function. Don't forget to use the package
argument.
help(package = "...")
help(package = "rvest")
The help page for a package provides a listing of all the functions in a package. Note that there are both html_text()
and html_text2()
functions. You should always use html_text2()
unless you have a good reason not to.
Click on the link to the DESCRIPTION file which appeared when you looked up the help page for rvest. Copy the the first two lines (Package and Title) below.
question_text(NULL, answer(NULL, correct = TRUE), allow_retry = TRUE, try_again_button = "Edit Answer", incorrect = NULL, rows = 3)
As the DESCRIPTION file mentions, rvest is mostly supplying wrappers for the more lower level (but harder to use) xml2 and httr packages. (As always, note how we, and everyone else uses the words "package" and "library" interchangeably.)
minimal_html()
is an rvest function used for teaching. It turns text into an HTML document. Run minimal_html()
with "test" as its first argument.
minimal_html("test")
minimal_html("test")
HTML stands for HyperText Markup Language, the underlying text of the web. The main difference between HTML and plain text is the use of various "elements" and "attributes" which tell your browser how to display the text.
Assign the output of minimal_html()
to an object called html_1
. Then, on the next line, pass html_1
as the argument to the class()
function.
html_1 <- minimal_html("test") class(html_1)
class(html_1)
HTML documents are an example from the broader class of XML objects. Recall that rvest is mostly a wrapper for the xml2 package, which allows us to work with XML objects.
Use minimal_html()
to create a more complex HTML document by running this code. Then, below this code, add html_2
so that we can examine the object.
html_2 <- minimal_html(" <h1>This is a heading</h1> <p id='first'>This is a paragraph</p> <p class='important'>This is an important paragraph</p> ")
html_2 <- minimal_html(" <h1>This is a heading</h1> <p id='first'>This is a paragraph</p> <p class='important'>This is an important paragraph</p> ") html_2
html_2
Just typing/returning html_2
produces the same result as print(html_2)
. By design, the print()
function should provide the key information about the object being printed. The first line tells us that html_2
is an HTML document. The two listed elements ("head" and "body") are the two key parts of every HTML document. The "body" is the more important part since it contains all the information.
We have added html_2
to your workspace so that you can use it to start pipes. Pipe html_2
to html_elements("p")
.
html_2 |> html_elements("...")
html_2 |> html_elements("p")
html_elements()
pulls out all the elements that match the selector, which is provided to the css
argument. "Elements" consist of a start tag (e.g. \<p>), optional attributes (id='first'), an end tag4 (like \</p>). The "contents" of an element are everything in between the start and end tag.
Since there are two elements with the "p" tag, where the "p" is for "paragraph," the result from html_elements()
are those two elements. The element with the "h1" tag is not included.
Pipe html_2
to html_elements(".important")
. Note that the css
argument is ".important" --- with a leading dot --- even though the attribute is "important" without a dot.
html_2 |> html_elements(".important")
html_2 |> html_elements(".important")
When using html_elements()
to select nodes on the basis of the class attribute rather than a tags we need a leading dot. Tags, on the other hand, are provided as is. Notice that the result of a call to html_elements()
is an "xml_nodeset" rather than a full HTML document.
Pipe html_2
to html_elements("#first")
. Note that the css
argument is "#first" --- with a leading hash mark --- even though the attribute, an id in this case, is "first" without a hash mark
html_2 |> html_elements("...")
html_2 |> html_elements("#first")
Id attributes must be unique within a document, so this will only ever select a single element.
CSS is short for cascading style sheets, and is a tool for defining the visual styling of HTML documents. CSS includes a miniature language for selecting elements on a page called CSS selectors. CSS selectors define patterns for locating HTML elements, and are useful for scraping because they provide a concise way of describing which elements you want to extract.
Let's consider another piece of html. Run this code and then, below it, add print(html_3)
.
html_3 <- minimal_html(" <ul> <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li> <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li> <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li> <li><b>R4-P17</b> is a <i>droid</i></li> </ul> ")
... print(html_3)
print(html_3)
Notice how the "body" portion of the html_document includes the raw html text, especially the nested list. The <ul>
tag starts an unordered list. The <li>
tag starts an ordered list.
Pipe html_3
to html_elements("li")
in order to make a vector where each element corresponds to a different character.
html_3 |> html_elements("..")
html_3 |> html_elements("li")
There’s an important difference between html_element()
and html_elements()
when you use a selector that doesn’t match any elements. html_elements()
returns a vector of length 0, whereas html_element()
returns a missing value.
Continue the pipe to html_element("b")
, which will pull out the elements from the list which are tagged with a "b", for bolded. Note that this is html_element()
, not html_elements()
.
... |> html_element("b")
html_3 |> html_elements("li") |> html_element("b")
We are using clues in the HTML formatting --- bolding, in this case --- to determine a substantive issue, which are the names of the characters. Nothing guarantees that this two things are connected. They don't have to be! And, if the HTML document itself is sloppy or inconsistent, they won't be.
You need to look closely at your results as you build your pipe, step-by-step.
Continue the pipe to html_text2()
.
... |> html_text2()
html_3 |> html_elements("li") |> html_element("b") |> html_text2()
The last step of a web scrape is often html_text2()
, since we (almost always) want to get rid of any weird formatting.
The last web scraping trick we cover involves how to handle html tables. HTML tables are built up from four main elements: <table>
, <tr>
(table row), <th>
(table heading), and <td>
(table data).
Run this code to create a simple HTML table with two columns and three rows. Then, add print(html_4)
to the end.
html_4 <- minimal_html(" <table class='mytable'> <tr><th>x</th> <th>y</th></tr> <tr><td>1.5</td> <td>2.7</td></tr> <tr><td>4.9</td> <td>1.3</td></tr> <tr><td>7.2</td> <td>8.1</td></tr> </table> ")
... print(html_4)
print(html_4)
Note how the <body>
section of the html_document begins with the <table>
element. Also, note how the class of the table object is "mytable".
Pipe html_4
to html_element(".mytable")
html_4 |> html_element("..")
html_4 |> html_element(".mytable")
We are using the fact that the object has class "mytable." Recall that, when accessing objects using their class, you must place a .
in front of the name of the class.
Continue the pipe to html_table()
.
... |> html_table()
html_4 |> html_element(".mytable") |> html_table()
You won't always be lucky enough to have the data you want in a table, but, if you are, html_table()
is magic.
With tools like html_element(s)()
, html_text2()
and html_table()
, you are ready to deal with HTML documents from the wild.
Consider two examples of material we might want to scrape. The first is information about 7 Star Wars films from a vignette from the rvest package. The second involves the 250 highest rated films from IMDB. In this section, we will scrape and organize information from both.
Examine the webpage for the Star Wars films. The URL is https://rvest.tidyverse.org/articles/starwars.html
. Pipe this URL to read_html()
. (Be sure to surround the URL with quotation marks.)
"https://rvest.tidyverse.org/articles/starwars.html" |> read_..()
Before we talk much about web scraping, we should talk about whether it is legal and ethical to do so. Overall, the situation is complicated. Legalities depend a lot on where you live. However, as a general rule of thumb, if the data is public, non-personal, and factual, you're most likely ok. These three factors connect to the site’s terms and conditions, personally identifiable information, and copyright, hence their importance. If these factors are false, or you're scraping the web to make money, it's a good idea to talk to a lawyer, but in any case of web scraping, be respectful of the resources of the server hosting the page(s). This means that if you're scraping many pages, you should wait a bit in between each request.
The structure of the underlying HTML looks like this:
<section> <h2 data-id="1">The Phantom Menace</h2> <p>Released: 1999-05-19</p> <p>Director: <span class="director">George Lucas</span></p> <div class="crawl"> <p>...</p> <p>...</p> <p>...</p> </div> </section>
Continue the pipe with html_elements("section")
.
... |> html_elements("..")
We almost always begin scraping by creating an HTML document which can then be manipulated with functions like html_elements()
. Note that html_elements()
is not the same thing as html_element()
, without the "s".
Continue the pipe with html_element("h2")
. We want to pull out just the title of each movie. We can see that the titles are tagged with "h2".
... |> html_element("h2")
Note that we used html_element()
in this example, as opposed to html_elements()
(with the "s") last time. In this case, both functions produce the same answer.
Continue the pipe with html_text2()
. We almost always want to clean up the underlying text before making use of it. Never use html_text()
(without the "2") unless you have a good reason to.
... |> html_...()
html_text2()
differs from html_text()
slightly. while html_text()
collapses all the text into a single string, html_text2()
preserves the white space between elements, allowing you to retain the structure and formatting of the text as it appears on the web page.
Place the above chunk of code into the middle of a call to tibble()
in which you are creating a tibble with one variable, title
, which is the title of each movie. The resulting tibble should have one column and seven rows.
...(title = "https://rvest.tidyverse.org/articles/starwars.html" |> read_html() |> html_elements("section") |> html_elements("h2") |> html_text2())
A tibble is a data frame, but it tweaks some older behaviors to make life easier. R is an old language, and some features useful then now get in your way. The tibble package allows you to make data frames in a more streamlined and cleaner way compared to R's traditional data frames.
1) "https://rvest.tidyverse.org/articles/starwars.html"
2) ... |> read_html()
3) ... |> html_elements("section")
4) Behind the scenes, we have created an object named section
which is the result of this pipe. Type section
in the exercise block and hit "Run Code."
now we're going to learn how to streamline this process. Copy https://rvest.tidyverse.org/articles/starwars.html" into the code box and hit run code
"https://rvest.tidyverse.org/articles/starwars.html"
"https://rvest.tidyverse.org/articles/starwars.html"
What we're attempting to do here is rather than piping the web page every time we web scrape, we make an object to hold the results of the pipe, which we can call when needed.
Pipe this to read_html()
.
"https://rvest.tidyverse.org/articles/starwars.html" |> read_html()
"https://rvest.tidyverse.org/articles/starwars.html" |> read_html()
you can notice that the steps are currently similar right now, but there will be less steps needed to pull this off.
Continue that pipe to html_elements()
with "section" being its argument.
"https://rvest.tidyverse.org/articles/starwars.html" |> read_html() |> html_elements(...)
"https://rvest.tidyverse.org/articles/starwars.html" |> read_html() |> html_elements("section")
Behind the scenes, we have created an object named section
which is the result of this pipe. Type section
in the exercise block and hit "Run Code."
section
section
Creating an object like this is helpful when you need to web scrape the same page multiple times. It allows you to save on typing and makes the code easier to read and process.
Now that we have an object, we can use it for our tibble. Copy the code from above which created a tibble with one column, title
. Change the code so that "https://rvest.tidyverse.org/articles/starwars.html", is replaced with just section
.
tibble(title = s... |> html_elements("h2") |> html_text2())
tibble(title = section |> html_elements("h2") |> html_text2())
we've now just created a tibble thats identical to the one above, but it was easier to write and will save time if we need to do so multiple times.
Consider the highest rated movies on IMDb, the Internet Movie Database. The URL is https://web.archive.org/web/20200101054123/https://www.imdb.com/chart/top/
. Pipe that URL (with quotation marks) to read_html()
.
"https://web.archive.org/web/20200101054123/https://www.imdb.com/chart/top/" |> ...
Note that read_html()
takes longer to run than the previous example because the underlying webpage is much more complex.
The key element is a table, which is often the case with web scraping exercises. Continue the pipe to html_element(".chart")
.
... |> html_element("...")
The result is no longer an "html_document". Instead, it is an "html_node".
Data is normally stored in an HTML table, and its just a matter of reading it from that table. HTML tables are built up from four main elements:
(table heading), and | (table data). rvest provides us with a function to read this data: html:table(), which returns a list holding one tibble for each table found on the page. you can use html:element() to identify the table you want to extract.
Exercise 13Continue the pipe to ... |> html_table() The result is a tibble with 250 rows and 5 columns. But, as you can see, the column names are a mess, as our the values in many of the cells. We need to clean it up. Exercise 14Continue the pipe to a ... |> select( rank_title_year = `Rank & Title`, rating = `IMDb Rating` )
Exercise 15Continue the pipe with a call to ... |> mutate( rank_title_year = str_squish(rank_title_year) )
Exercise 16Continue the call to ... |> mutate(rank = parse_number(rank_title_year)) This approach depends on the fact that Exercise 17Continue the pipe to ... |> mutate(title = str_extract(rank_title_year, pattern = "(?<=\\d\\.\\s)(.*)(?=\\s\\(\\d{4}\\))")) We gave ChatGPT this prompt:
It gave back the regular expression above, explaining:
Use AI to solve any regular expression problem that takes longer than 30 seconds. Exercise 18Complete the pipe with a You now understand web scraping. SummaryThis tutorial covered Chapter 24: Web scraping from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. MDN Web Docs describe just about every aspect of web programming. Try the r4ds.tutorials package in your browserAny scripts or data that you put into this service are public.
Embedding an R snippet on your website
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets. |
---|