Web scraping
In r4ds.tutorials: Tutorials for "R for Data Science"

library(learnr)
library(tutorial.helpers)
library(tidyverse)
library(rvest)
knitr::opts_chunk$set(echo = FALSE)
options(tutorial.exercise.timelimit = 60, 
        tutorial.storage = "local") 

html_1 <- minimal_html("test")

html_2 <- minimal_html("
  <h1>This is a heading</h1>
  <p id='first'>This is a paragraph</p>
  <p class='important'>This is an important paragraph</p>
")

html_3 <- minimal_html("
  <ul>
    <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>
    <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>
    <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>
    <li><b>R4-P17</b> is a <i>droid</i></li>
  </ul>
  ")

html_4 <- minimal_html("
  <table class='mytable'>
    <tr><th>x</th>   <th>y</th></tr>
    <tr><td>1.5</td> <td>2.7</td></tr>
    <tr><td>4.9</td> <td>1.3</td></tr>
    <tr><td>7.2</td> <td>8.1</td></tr>
  </table>
  ")

# DK: This code is bad because it requires an internet connection to work. Clean
# it up so that it works regardless.

section <- "https://rvest.tidyverse.org/articles/starwars.html" |> 
         read_html() |> 
         html_elements("section")

Introduction

This tutorial covers Chapter 24: Web scraping from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. We will learn how to use the rvest, along with associated functions like read_html() and html_table() to download and organize data from the web.

Web scaping basics

Web scraping in R is a method of extracting data from websites using R programming language. It can be a valuable skill for students in data-related fields, allowing them to collect and analyze large amounts of data quickly and efficiently.

This section covers the basics of web scraping.

Exercise 1

The key package for web scraping with R is rvest. Load the rvest library.

library(rvest)

library(rvest)

The "Web Scraping 101" article from the rvest homepage provides useful background.

Exercise 2

Look up the help page for the rvest package using the help() function. Don't forget to use the package argument.

help(package = "...")

help(package = "rvest")

The help page for a package provides a listing of all the functions in a package. Note that there are both html_text() and html_text2() functions. You should always use html_text2() unless you have a good reason not to.

Exercise 3

Click on the link to the DESCRIPTION file which appeared when you looked up the help page for rvest. Copy the the first two lines (Package and Title) below.

question_text(NULL,
    answer(NULL, correct = TRUE),
    allow_retry = TRUE,
    try_again_button = "Edit Answer",
    incorrect = NULL,
    rows = 3)

As the DESCRIPTION file mentions, rvest is mostly supplying wrappers for the more lower level (but harder to use) xml2 and httr packages. (As always, note how we, and everyone else uses the words "package" and "library" interchangeably.)

Exercise 4

minimal_html() is an rvest function used for teaching. It turns text into an HTML document. Run minimal_html() with "test" as its first argument.

minimal_html("test")

minimal_html("test")

HTML stands for HyperText Markup Language, the underlying text of the web. The main difference between HTML and plain text is the use of various "elements" and "attributes" which tell your browser how to display the text.

Exercise 5

Assign the output of minimal_html() to an object called html_1. Then, on the next line, pass html_1 as the argument to the class() function.

html_1 <- minimal_html("test")
class(html_1)

class(html_1)

HTML documents are an example from the broader class of XML objects. Recall that rvest is mostly a wrapper for the xml2 package, which allows us to work with XML objects.

Exercise 6

Use minimal_html() to create a more complex HTML document by running this code. Then, below this code, add html_2 so that we can examine the object.

html_2 <- minimal_html("
  <h1>This is a heading</h1>
  <p id='first'>This is a paragraph</p>
  <p class='important'>This is an important paragraph</p>
")

html_2 <- minimal_html("
  <h1>This is a heading</h1>
  <p id='first'>This is a paragraph</p>
  <p class='important'>This is an important paragraph</p>
")

html_2

html_2

Just typing/returning html_2 produces the same result as print(html_2). By design, the print() function should provide the key information about the object being printed. The first line tells us that html_2 is an HTML document. The two listed elements ("head" and "body") are the two key parts of every HTML document. The "body" is the more important part since it contains all the information.

Exercise 7

We have added html_2 to your workspace so that you can use it to start pipes. Pipe html_2 to html_elements("p").

html_2 |> 
  html_elements("...")

html_2 |> 
  html_elements("p")

html_elements() pulls out all the elements that match the selector, which is provided to the css argument. "Elements" consist of a start tag (e.g. \<p>), optional attributes (id='first'), an end tag4 (like \</p>). The "contents" of an element are everything in between the start and end tag.

Since there are two elements with the "p" tag, where the "p" is for "paragraph," the result from html_elements() are those two elements. The element with the "h1" tag is not included.

Exercise 8

Pipe html_2 to html_elements(".important"). Note that the css argument is ".important" --- with a leading dot --- even though the attribute is "important" without a dot.

html_2 |> 
  html_elements(".important")

html_2 |> 
  html_elements(".important")

When using html_elements() to select nodes on the basis of the class attribute rather than a tags we need a leading dot. Tags, on the other hand, are provided as is. Notice that the result of a call to html_elements() is an "xml_nodeset" rather than a full HTML document.

Exercise 9

Pipe html_2 to html_elements("#first"). Note that the css argument is "#first" --- with a leading hash mark --- even though the attribute, an id in this case, is "first" without a hash mark

html_2 |> 
  html_elements("...")

html_2 |> 
  html_elements("#first")

Id attributes must be unique within a document, so this will only ever select a single element.

CSS is short for cascading style sheets, and is a tool for defining the visual styling of HTML documents. CSS includes a miniature language for selecting elements on a page called CSS selectors. CSS selectors define patterns for locating HTML elements, and are useful for scraping because they provide a concise way of describing which elements you want to extract.

Exercise 10

Let's consider another piece of html. Run this code and then, below it, add print(html_3).

html_3 <- minimal_html("
  <ul>
    <li><b>C-3PO</b> is a <i>droid</i> that weighs <span class='weight'>167 kg</span></li>
    <li><b>R2-D2</b> is a <i>droid</i> that weighs <span class='weight'>96 kg</span></li>
    <li><b>Yoda</b> weighs <span class='weight'>66 kg</span></li>
    <li><b>R4-P17</b> is a <i>droid</i></li>
  </ul>
  ")

...

print(html_3)

print(html_3)

Notice how the "body" portion of the html_document includes the raw html text, especially the nested list. The <ul> tag starts an unordered list. The <li> tag starts an ordered list.

Exercise 11

Pipe html_3 to html_elements("li") in order to make a vector where each element corresponds to a different character.

html_3 |> 
  html_elements("..")

html_3 |> 
  html_elements("li")

There’s an important difference between html_element() and html_elements() when you use a selector that doesn’t match any elements. html_elements() returns a vector of length 0, whereas html_element() returns a missing value.

Exercise 12

Continue the pipe to html_element("b"), which will pull out the elements from the list which are tagged with a "b", for bolded. Note that this is html_element(), not html_elements().

... |> 
  html_element("b")

html_3 |> 
  html_elements("li")  |> 
  html_element("b")

We are using clues in the HTML formatting --- bolding, in this case --- to determine a substantive issue, which are the names of the characters. Nothing guarantees that this two things are connected. They don't have to be! And, if the HTML document itself is sloppy or inconsistent, they won't be.

You need to look closely at your results as you build your pipe, step-by-step.

Exercise 13

Continue the pipe to html_text2().

... |> 
  html_text2()

html_3 |> 
  html_elements("li") |> 
  html_element("b") |> 
  html_text2()

The last step of a web scrape is often html_text2(), since we (almost always) want to get rid of any weird formatting.

Exercise 14

The last web scraping trick we cover involves how to handle html tables. HTML tables are built up from four main elements: <table>, <tr> (table row), <th> (table heading), and <td> (table data).

Run this code to create a simple HTML table with two columns and three rows. Then, add print(html_4) to the end.

html_4 <- minimal_html("
  <table class='mytable'>
    <tr><th>x</th>   <th>y</th></tr>
    <tr><td>1.5</td> <td>2.7</td></tr>
    <tr><td>4.9</td> <td>1.3</td></tr>
    <tr><td>7.2</td> <td>8.1</td></tr>
  </table>
  ")

...

print(html_4)

print(html_4)

Note how the <body> section of the html_document begins with the <table> element. Also, note how the class of the table object is "mytable".

Exercise 15

Pipe html_4 to html_element(".mytable")

html_4 |> 
  html_element("..")

html_4 |> 
  html_element(".mytable")

We are using the fact that the object has class "mytable." Recall that, when accessing objects using their class, you must place a . in front of the name of the class.

Exercise 16

Continue the pipe to html_table().

... |> 
  html_table()

html_4 |> 
  html_element(".mytable") |> 
  html_table()

You won't always be lucky enough to have the data you want in a table, but, if you are, html_table() is magic.

With tools like html_element(s)(), html_text2() and html_table(), you are ready to deal with HTML documents from the wild.

Web scaping examples

Consider two examples of material we might want to scrape. The first is information about 7 Star Wars films from a vignette from the rvest package. The second involves the 250 highest rated films from IMDB. In this section, we will scrape and organize information from both.

Exercise 1

Examine the webpage for the Star Wars films. The URL is https://rvest.tidyverse.org/articles/starwars.html. Pipe this URL to read_html(). (Be sure to surround the URL with quotation marks.)

"https://rvest.tidyverse.org/articles/starwars.html" |> 
  read_..()

Before we talk much about web scraping, we should talk about whether it is legal and ethical to do so. Overall, the situation is complicated. Legalities depend a lot on where you live. However, as a general rule of thumb, if the data is public, non-personal, and factual, you're most likely ok. These three factors connect to the site’s terms and conditions, personally identifiable information, and copyright, hence their importance. If these factors are false, or you're scraping the web to make money, it's a good idea to talk to a lawyer, but in any case of web scraping, be respectful of the resources of the server hosting the page(s). This means that if you're scraping many pages, you should wait a bit in between each request.

The structure of the underlying HTML looks like this:

<section>
  <h2 data-id="1">The Phantom Menace</h2>
  <p>Released: 1999-05-19</p>
  <p>Director: <span class="director">George Lucas</span></p>

  <div class="crawl">
    <p>...</p>
    <p>...</p>
    <p>...</p>
  </div>
</section>

Exercise 2

Continue the pipe with html_elements("section").

... |> 
  html_elements("..")

We almost always begin scraping by creating an HTML document which can then be manipulated with functions like html_elements(). Note that html_elements() is not the same thing as html_element(), without the "s".

Exercise 3

Continue the pipe with html_element("h2"). We want to pull out just the title of each movie. We can see that the titles are tagged with "h2".

... |> 
  html_element("h2")

Note that we used html_element() in this example, as opposed to html_elements() (with the "s") last time. In this case, both functions produce the same answer.

Exercise 4

Continue the pipe with html_text2(). We almost always want to clean up the underlying text before making use of it. Never use html_text() (without the "2") unless you have a good reason to.

... |> 
  html_...()

html_text2() differs from html_text() slightly. while html_text() collapses all the text into a single string, html_text2() preserves the white space between elements, allowing you to retain the structure and formatting of the text as it appears on the web page.

Exercise 5

Place the above chunk of code into the middle of a call to tibble() in which you are creating a tibble with one variable, title, which is the title of each movie. The resulting tibble should have one column and seven rows.

...(title = 
         "https://rvest.tidyverse.org/articles/starwars.html" |> 
         read_html() |> 
         html_elements("section") |> 
         html_elements("h2") |> 
         html_text2())

A tibble is a data frame, but it tweaks some older behaviors to make life easier. R is an old language, and some features useful then now get in your way. The tibble package allows you to make data frames in a more streamlined and cleaner way compared to R's traditional data frames.

1) "https://rvest.tidyverse.org/articles/starwars.html"

2) ... |> read_html()

3) ... |> html_elements("section")

4) Behind the scenes, we have created an object named section which is the result of this pipe. Type section in the exercise block and hit "Run Code."

Exercise 6

now we're going to learn how to streamline this process. Copy https://rvest.tidyverse.org/articles/starwars.html" into the code box and hit run code

"https://rvest.tidyverse.org/articles/starwars.html"

"https://rvest.tidyverse.org/articles/starwars.html"

What we're attempting to do here is rather than piping the web page every time we web scrape, we make an object to hold the results of the pipe, which we can call when needed.

Exercise 7

Pipe this to read_html().

"https://rvest.tidyverse.org/articles/starwars.html" |> 
         read_html()

"https://rvest.tidyverse.org/articles/starwars.html" |> 
         read_html()

you can notice that the steps are currently similar right now, but there will be less steps needed to pull this off.

Exercise 8

Continue that pipe to html_elements() with "section" being its argument.

"https://rvest.tidyverse.org/articles/starwars.html" |> 
         read_html() |> 
         html_elements(...)

 "https://rvest.tidyverse.org/articles/starwars.html" |> 
         read_html() |> 
         html_elements("section")

Exercise 9

Behind the scenes, we have created an object named section which is the result of this pipe. Type section in the exercise block and hit "Run Code."

section

section

Creating an object like this is helpful when you need to web scrape the same page multiple times. It allows you to save on typing and makes the code easier to read and process.

Exercise 10

Now that we have an object, we can use it for our tibble. Copy the code from above which created a tibble with one column, title. Change the code so that "https://rvest.tidyverse.org/articles/starwars.html", is replaced with just section.

tibble(title = s... |> 
         html_elements("h2") |> 
         html_text2())

tibble(title = section |> 
         html_elements("h2") |> 
         html_text2())

we've now just created a tibble thats identical to the one above, but it was easier to write and will save time if we need to do so multiple times.

Exercise 11

Consider the highest rated movies on IMDb, the Internet Movie Database. The URL is https://web.archive.org/web/20200101054123/https://www.imdb.com/chart/top/. Pipe that URL (with quotation marks) to read_html().

"https://web.archive.org/web/20200101054123/https://www.imdb.com/chart/top/" |> 
  ...

Note that read_html() takes longer to run than the previous example because the underlying webpage is much more complex.

Exercise 12

The key element is a table, which is often the case with web scraping exercises. Continue the pipe to html_element(".chart").

... |> 
  html_element("...")

The result is no longer an "html_document". Instead, it is an "html_node".

Data is normally stored in an HTML table, and its just a matter of reading it from that table. HTML tables are built up from four main elements:

, (table row),

(table heading), and

(table heading), and	(table data). rvest provides us with a function to read this data: html:table(), which returns a list holding one tibble for each table found on the page. you can use html:element() to identify the table you want to extract. Exercise 13 Continue the pipe to `html_table()`. ... \|> html_table() The result is a tibble with 250 rows and 5 columns. But, as you can see, the column names are a mess, as our the values in many of the cells. We need to clean it up. Exercise 14 Continue the pipe to a `select()` call which renames the second column to `rank_title_year` and the third column to `rating`. ... \|> select( rank_title_year = `Rank & Title`, rating = `IMDb Rating` ) `select()` does two things: it keeps only two columns, both of which it renames. But the `rank_title_year` is a mess because it really includes three different variables, all mashed together. Exercise 15 Continue the pipe with a call to `mutate()` which runs `str_squish()` on `rank_title_year`, assigning the result to `rank_title_year`. Modifying a variable "in place" like this is a common approach. We don't need to keep the original value around. ... \|> mutate( rank_title_year = str_squish(rank_title_year) ) `str_squish()` removes whitespace at the start and end or a character variable, and replaces all internal whitespace with a single space. Exercise 16 Continue the call to `mutate()`, creating a new variable called `rank` which is the result of running `parse_number()` on `rank_title_year`. ... \|> mutate(rank = parse_number(rank_title_year)) This approach depends on the fact that `parse_number()` grabs the first number it can find in the string. It ignores characters and numbers which come after. Exercise 17 Continue the pipe to `mutate()`, creating a new variable `title` which is the result of running `str_extract()` on `rank_title_year` with the pattern argument set to `"(?<=\\d\\.\\s)(.)(?=\\s\\(\\d{4}\\))"`. ... \|> mutate(title = str_extract(rank_title_year, pattern = "(?<=\\d\\.\\s)(.)(?=\\s\\(\\d{4}\\))")) We gave ChatGPT this prompt: Using the stringr package in R, I want to run str_extract() on the string "1. The Shawshank Redemption (1994)" and extract just "The Shawshank Redemption". What pattern should I use? It gave back the regular expression above, explaining: This will extract the substring between the digit and dot followed by a space, and the year in parentheses. The (`?<=`) and (`?=`) are positive lookbehind and lookahead assertions respectively, which match the pattern inside them but do not include it in the final output. The `.` matches any characters in between. Use AI to solve any regular expression problem that takes longer than 30 seconds. Exercise 18 Complete the pipe with a `select()` statement which keeps `rank`, `title` and `rating` in that order. You now understand web scraping. Summary This tutorial covered Chapter 24: Web scraping from R for Data Science (2e)* by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund. MDN Web Docs describe just about every aspect of web programming. Try the r4ds.tutorials package in your browser library(r4ds.tutorials) help(r4ds.tutorials) Any scripts or data that you put into this service are public. Nothing r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m. R Package Documentation rdrr.io home R language documentation Run R code online Browse R Packages CRAN packages Bioconductor packages R-Forge packages GitHub packages We want your feedback! Note that we can't provide technical support on individual packages. You should contact the package authors for that. Tweet to @rdrrHQ GitHub issue tracker ian@mutexlabs.com Personal blog What can we improve? Extra info (optional) Embedding an R snippet on your website Add the following code to your website. REMOVE THIS For more information on customizing the embed code, read Embedding Snippets. Close

(table data). rvest provides us with a function to read this data: html:table(), which returns a list holding one tibble for each table found on the page. you can use html:element() to identify the table you want to extract.

Exercise 13

Continue the pipe to html_table().

... |> 
  html_table()

The result is a tibble with 250 rows and 5 columns. But, as you can see, the column names are a mess, as our the values in many of the cells. We need to clean it up.

Exercise 14

Continue the pipe to a select() call which renames the second column to rank_title_year and the third column to rating.

... |> 
  select(
      rank_title_year = `Rank & Title`,
      rating = `IMDb Rating`
    )

select() does two things: it keeps only two columns, both of which it renames. But the rank_title_year is a mess because it really includes three different variables, all mashed together.

Exercise 15

Continue the pipe with a call to mutate() which runs str_squish() on rank_title_year, assigning the result to rank_title_year. Modifying a variable "in place" like this is a common approach. We don't need to keep the original value around.

... |> 
  mutate(
    rank_title_year = str_squish(rank_title_year)
  )

str_squish() removes whitespace at the start and end or a character variable, and replaces all internal whitespace with a single space.

Exercise 16

Continue the call to mutate(), creating a new variable called rank which is the result of running parse_number() on rank_title_year.

... |> 
  mutate(rank = parse_number(rank_title_year))

This approach depends on the fact that parse_number() grabs the first number it can find in the string. It ignores characters and numbers which come after.

Exercise 17

Continue the pipe to mutate(), creating a new variable title which is the result of running str_extract() on rank_title_year with the pattern argument set to "(?<=\\d\\.\\s)(.*)(?=\\s\\(\\d{4}\\))".

... |> 
  mutate(title = str_extract(rank_title_year, 
                             pattern = "(?<=\\d\\.\\s)(.*)(?=\\s\\(\\d{4}\\))"))

We gave ChatGPT this prompt:

Using the stringr package in R, I want to run str_extract() on the string "1. The Shawshank Redemption (1994)" and extract just "The Shawshank Redemption". What pattern should I use?

It gave back the regular expression above, explaining:

This will extract the substring between the digit and dot followed by a space, and the year in parentheses. The (?<=) and (?=) are positive lookbehind and lookahead assertions respectively, which match the pattern inside them but do not include it in the final output. The .* matches any characters in between.

Use AI to solve any regular expression problem that takes longer than 30 seconds.

Exercise 18

Complete the pipe with a select() statement which keeps rank, title and rating in that order.

You now understand web scraping.

Summary

This tutorial covered Chapter 24: Web scraping from R for Data Science (2e) by Hadley Wickham, Mine Çetinkaya-Rundel, and Garrett Grolemund.

MDN Web Docs describe just about every aspect of web programming.

Any scripts or data that you put into this service are public.

r4ds.tutorials documentation built on April 3, 2025, 5:50 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

r4ds.tutorials Tutorials for "R for Data Science"

Web scraping In r4ds.tutorials: Tutorials for "R for Data Science"

Introduction

Web scaping basics

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercise 16

Web scaping examples

Exercise 1

Exercise 2

Exercise 3

Exercise 4

Exercise 5

Exercise 6

Exercise 7

Exercise 8

Exercise 9

Exercise 10

Exercise 11

Exercise 12

Exercise 13

Exercise 14

Exercise 15

Exercise 16

Exercise 17

Exercise 18

Summary

Try the r4ds.tutorials package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

r4ds.tutorials
Tutorials for "R for Data Science"

Web scraping
In r4ds.tutorials: Tutorials for "R for Data Science"