knitr::opts_chunk$set(
  # code chunk options
  echo = FALSE
  , eval = TRUE
  , warning = FALSE
  , message = FALSE
  , cached = FALSE 
  , exercise = TRUE
  , context = "data"
)
library(learnr)
library(dplyr)
# library(learn2scrape)
library(rvest)

Introduction

In this tutorial, we show some important advanced web scraping techniques.

  1. Using sessions
  2. Interacting with HTML forms
  3. Logging in
  4. User agents

Using session

Thus far, we have relied on xml2::read_html() to read and parse websites HTML code. This is fine for many purposes. But a more flexible way to interact with websites is through a Browser session. This has many advantages

Sessions an be created in rvest with the session() function (html_session() for rvest < 2.0.0). The only required parameter is url --- the address of the webpage to request.

Additional (optional) parameters can be passed to ..., however, and they are forwarded to httr::GET(). This makes configuring rvest sessions particularly easy. Some of the most important features are

More on this later.

First, have a look at this page which we will scrape in the example.

url <- "https://scrapethissite.com/pages/simple/"
sess <- session(url)
url <- "https://scrapethissite.com/pages/simple/"
sess <- session(url)

Extra: A peek under the hood of rvest::session()

What does rvest::session()actually do? The documentation (?rvest::session) is not particularly informative. But we can learn something by looking at the function's source code:

It first creates a 'session' object calling the base R structure() function:

session <- structure(
  list(
    # passed to argument 'handle' when calling httr::GET
    handle = httr::handle(url)
    # passed to argument 'config' when calling httr::GET
    , config = c(..., httr::config(autoreferer = 1L))
    # placeholder for URL that will be queried by rvest:::session_get
    , url = NULL
    , back = character()
    , forward = character()
    # placeholder for response objected that will be returned by httr::GET
    , response = NULL
    # ignore this
    , html = new.env(parent = emptyenv(), hash = FALSE)
  )
  , class = "session"
)

Next, it passes the 'session' object and the input URL to an internal function:

rvest:::session_get(session, url)

This function makes a GET request using the httr package (see ?httr::GET). Specifically,

  1. it passes the URL to httr::GET()'s url argument
  2. it passes the 'config' element of 'session' to httr::GET()'s config argument
  3. it passes the 'handle' element of 'session' to httr::GET()'s handle argument
  4. it makes the get request
  5. it parses the response and adds information to the 'session' input object, which is returned enhanced with information like GET request response status

So rvest::session() is basically a wrapper around httr::GET() that eases interaction with request and reponse objects.


If we inspect the return object assigned to 'sess', we see that it is a 'rvest_session' object. Objects of this class containing httr 'handle', 'request' and 'response' objects. In addition, 'rvest_session' objects record the current URL and the current pages HTML code (in the 'cache' element).

url <- "https://scrapethissite.com/pages/simple/"
sess <- session(url)

sess
str(sess, 1)
sess$response

because the (parsed) HTML code of the requested page is associated with the 'rvest_session' object, we can call any rvest function on it as we would do on the html_document returned by xml2::read_html(). For example:

url <- "https://scrapethissite.com/pages/simple/"
sess <- session(url)

sess %>% 
  html_elements(".country-name") %>% 
  html_text(trim = TRUE) %>% 
  tibble(country_name = .)

Navigating

But session objects also allow navigating. To navigate to a new page, simply call session_jump_to() function on the session object and the target URL.

The session object then will keep track of the current URL and all previously visited URLs. They can be reported by calling session_history().

And because the session object 'remembers' the previous URL, we can go not only forward, but also backward.

url <- "https://scrapethissite.com/pages/simple/"
sess <- session(url)

# move forward
sess <- session_jump_to(sess, "https://www.scrapethissite.com/pages/forms")

# inspect history
sess$url
sess$back
session_history(sess)

# back to first page
sess <- session_back(sess)
sess$url

# and again back to second page
sess <- session_forward(sess)
sess$url

Interacting with HTML forms

HTML forms allow collecting user input. The page https://scrapethissite.com/pages/forms/ shows a simple example of a form --- in this case a search bar. The search bar allows users entering key words to filter the data by that is shown in the table below the search bar.

Let's use this example to see how to interact with forms using rvest.

url <- "https://scrapethissite.com/pages/forms/"
sess <- session(url)
url <- "https://scrapethissite.com/pages/forms/"
sess <- session(url)

Extracting forms

Forms included in a page can be extracted with the html_form() function. This function returns a list of 'rvest_form' objects.

url <- "https://scrapethissite.com/pages/forms/"
sess <- session(url)

forms <- html_form(sess)
str(forms, 1)

Note: if you apply html_form() directly to a 'form' web element (instead of a list of web elements as returned by html_element()), then this will return the 'rvest_form' object directly instead of in a list.

To interact with a specific form, we extract it from the list. In this example, there is only one form that corresponds to the search bar displayed in the top of the page.

url <- "https://scrapethissite.com/pages/forms/"
sess <- session(url)

forms <- html_form(sess)
a_form <- forms[[1]]
str(a_form, 1)
url <- "https://scrapethissite.com/pages/forms/"
sess <- session(url)
a_form <- html_form(sess)[[1]]

Structure of form objects

If the form is named, its name can be accessed in the 'name' element. The 'method' element notes the HTTP method. Most important is the 'fields' element. It records the fields of the form.

a_form$fields

Each element is a 'form_field' object with four elements:

a_field <- a_form$fields[[1]]
str(a_field, 1)

Note: Other field types may have additional elements. It's always helpful to first inspect the fields of a form before interacting with it.

Interacting with a form

Once we have extracted the form and understood its structure, we can interact with it. The two most elemental interactions are

  1. setting values
  2. submitting the form

Setting values

To set a value in a form, we use the html_form_set() function (previously rvest::set_values()). For example, we can set a the query to a search term: "New York" We pass this information as parameter--value pair to the function:

a_form_filled <- html_form_set(a_form, q = "New York")

Note: If you want to set more than one form input parameter, simply add additional parameter--value pairs to the function call

a_form_filled <- html_form_set(a_form, q = "New York")

We can look at the value of the 'q' text field to verify that our query has been set:

a_form_filled$fields$q$value

Submitting a form

To submit a form, we require the current session object and the filled form. We then pass these objects to the session_submit() function:

sess <- session_submit(sess, a_form_filled)
sess <- session_submit(sess, a_form_filled)

We can verify that the submission has worked by extracting the results shown in the searchable data table:

sess %>% 
  html_element("table") %>% 
  html_table() %>% 
  pull(1) %>% 
  table()

Great! The names of all listed Hockey team contain the term "New York".

Login

Simple sing-on login is easy to handle once you know how to handle HTTP forms with rvest We simply

  1. locate the Login form,
  2. pass our user information (email/user name and password), and
  3. submit the filled Login form

You can try this on https://www.stealmylogin.com/demo.html

# create session
url <- "https://www.stealmylogin.com/demo.html"
sess <- session(url)

# 1. locate the login form
login_form <- html_form(sess)[[1]]

# 2. pass user info

# what fields and types ?
purrr::map_chr(login_form$fields, "type")

# set values
login_form <- html_form_set(
  login_form, 
  # pass values to form fields
  "username" = "test.user@gmail.com", 
  "password" = "123456"
)

# check
purrr::map_chr(login_form$fields, "value")

# 3. submit 
sess <- session_submit(sess, login_form)


# Inspect result:

# was login successful?
httr::status_code(sess) # should be 200

# what info was posted?
URLdecode(rawToChar(sess$response$request$options$postfields))

# where have we been redirected?
sess$url

User agents

When making HTTP requests, we send information about who is making the the request. That is, we tell the server what user agent we are.

Because to scrape content from websites, rvest relies on httr for making HTTP requests, which, in turn, makes these request using the curl program, by default we send this as user agent information when we execute httr::GET or rvest::session().

sess$response$request$options$useragent

The same applies when we read and parse the HTML code of a website with xml2::read_html() because, to read data from a connection, under the hood xml2::read_html() uses curl as well.

There are many reasons why we want to overwrite this default user agent information:

Specifying the user agent

Remember how we said that rvest::session() simply wraps httr::GET() and this makes configuring interactions with websites really easy. Handling user agents is a perfect point in case. When creating a session object, we simply pass a user agent object created with httr::user_agent() to the ... argument of rvest::session():

ua <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36"
sess <- session(url, httr::user_agent(ua))

# check that this was successfull
is.null(sess$config$options$useragent) # should be FALSE
sess$config$options$useragent == ua # should be TRUE


theresagessler/learn2scrape documentation built on Dec. 23, 2021, 9:55 a.m.