knitr::opts_chunk$set( # code chunk options echo = FALSE , eval = TRUE , warning = FALSE , message = FALSE , cached = FALSE , exercise = TRUE , context = "data" )
library(learnr) library(dplyr) # library(learn2scrape) library(rvest)
In this tutorial, we show some important advanced web scraping techniques.
sessions
Thus far, we have relied on xml2::read_html()
to read and parse websites HTML code.
This is fine for many purposes.
But a more flexible way to interact with websites is through a Browser session.
This has many advantages
Sessions an be created in rvest
with the session()
function (html_session()
for rvest < 2.0.0).
The only required parameter is url
--- the address of the webpage to request.
Additional (optional) parameters can be passed to ...
, however, and they are forwarded to httr::GET()
.
This makes configuring rvest
sessions particularly easy.
Some of the most important features are
httr::add_headers()
,httr::authenticate()
,httr::set_cookies()
), and httr::use_proxy()
).First, have a look at this page which we will scrape in the example.
url <- "https://scrapethissite.com/pages/simple/" sess <- session(url)
url <- "https://scrapethissite.com/pages/simple/" sess <- session(url)
Extra: A peek under the hood of
rvest::session()
What does rvest::session()
actually do?
The documentation (?rvest::session
) is not particularly informative.
But we can learn something by looking at the function's source code:
It first creates a 'session' object calling the base R structure()
function:
session <- structure( list( # passed to argument 'handle' when calling httr::GET handle = httr::handle(url) # passed to argument 'config' when calling httr::GET , config = c(..., httr::config(autoreferer = 1L)) # placeholder for URL that will be queried by rvest:::session_get , url = NULL , back = character() , forward = character() # placeholder for response objected that will be returned by httr::GET , response = NULL # ignore this , html = new.env(parent = emptyenv(), hash = FALSE) ) , class = "session" )
Next, it passes the 'session' object and the input URL to an internal function:
rvest:::session_get(session, url)
This function makes a GET request using the httr
package (see ?httr::GET
).
Specifically,
httr::GET()
's url
argumenthttr::GET()
's config
argumenthttr::GET()
's handle
argumentSo rvest::session()
is basically a wrapper around httr::GET()
that eases interaction with request and reponse objects.
If we inspect the return object assigned to 'sess', we see that it is a 'rvest_session' object.
Objects of this class containing httr
'handle', 'request' and 'response' objects.
In addition, 'rvest_session' objects record the current URL and the current pages HTML code (in the 'cache' element).
url <- "https://scrapethissite.com/pages/simple/" sess <- session(url) sess str(sess, 1) sess$response
because the (parsed) HTML code of the requested page is associated with the 'rvest_session' object, we can call any rvest
function on it as we would do on the html_document
returned by xml2::read_html()
.
For example:
url <- "https://scrapethissite.com/pages/simple/" sess <- session(url) sess %>% html_elements(".country-name") %>% html_text(trim = TRUE) %>% tibble(country_name = .)
But session objects also allow navigating.
To navigate to a new page, simply call session_jump_to()
function on the session object and the target URL.
The session object then will keep track of the current URL and all previously visited URLs.
They can be reported by calling session_history()
.
And because the session object 'remembers' the previous URL, we can go not only forward, but also backward.
url <- "https://scrapethissite.com/pages/simple/" sess <- session(url) # move forward sess <- session_jump_to(sess, "https://www.scrapethissite.com/pages/forms") # inspect history sess$url sess$back session_history(sess) # back to first page sess <- session_back(sess) sess$url # and again back to second page sess <- session_forward(sess) sess$url
HTML forms allow collecting user input. The page https://scrapethissite.com/pages/forms/ shows a simple example of a form --- in this case a search bar. The search bar allows users entering key words to filter the data by that is shown in the table below the search bar.
Let's use this example to see how to interact with forms using rvest
.
url <- "https://scrapethissite.com/pages/forms/" sess <- session(url)
url <- "https://scrapethissite.com/pages/forms/" sess <- session(url)
Forms included in a page can be extracted with the html_form()
function.
This function returns a list of 'rvest_form' objects.
url <- "https://scrapethissite.com/pages/forms/" sess <- session(url) forms <- html_form(sess) str(forms, 1)
Note: if you apply html_form()
directly to a 'form' web element (instead of a list of web elements as returned by html_element()
), then this will return the 'rvest_form' object directly instead of in a list.
To interact with a specific form, we extract it from the list. In this example, there is only one form that corresponds to the search bar displayed in the top of the page.
url <- "https://scrapethissite.com/pages/forms/" sess <- session(url) forms <- html_form(sess) a_form <- forms[[1]] str(a_form, 1)
url <- "https://scrapethissite.com/pages/forms/" sess <- session(url) a_form <- html_form(sess)[[1]]
If the form is named, its name can be accessed in the 'name' element. The 'method' element notes the HTTP method. Most important is the 'fields' element. It records the fields of the form.
a_form$fields
Each element is a 'form_field' object with four elements:
a_field <- a_form$fields[[1]] str(a_field, 1)
Note: Other field types may have additional elements. It's always helpful to first inspect the fields of a form before interacting with it.
Once we have extracted the form and understood its structure, we can interact with it. The two most elemental interactions are
To set a value in a form, we use the html_form_set()
function (previously rvest::set_values()
).
For example, we can set a the query to a search term: "New York"
We pass this information as parameter--value pair to the function:
a_form_filled <- html_form_set(a_form, q = "New York")
Note: If you want to set more than one form input parameter, simply add additional parameter--value pairs to the function call
a_form_filled <- html_form_set(a_form, q = "New York")
We can look at the value of the 'q' text field to verify that our query has been set:
a_form_filled$fields$q$value
To submit a form, we require the current session object and the filled form.
We then pass these objects to the session_submit()
function:
sess <- session_submit(sess, a_form_filled)
sess <- session_submit(sess, a_form_filled)
We can verify that the submission has worked by extracting the results shown in the searchable data table:
sess %>% html_element("table") %>% html_table() %>% pull(1) %>% table()
Great! The names of all listed Hockey team contain the term "New York".
Simple sing-on login is easy to handle once you know how to handle HTTP forms with rvest
We simply
You can try this on https://www.stealmylogin.com/demo.html
# create session url <- "https://www.stealmylogin.com/demo.html" sess <- session(url) # 1. locate the login form login_form <- html_form(sess)[[1]] # 2. pass user info # what fields and types ? purrr::map_chr(login_form$fields, "type") # set values login_form <- html_form_set( login_form, # pass values to form fields "username" = "test.user@gmail.com", "password" = "123456" ) # check purrr::map_chr(login_form$fields, "value") # 3. submit sess <- session_submit(sess, login_form) # Inspect result: # was login successful? httr::status_code(sess) # should be 200 # what info was posted? URLdecode(rawToChar(sess$response$request$options$postfields)) # where have we been redirected? sess$url
When making HTTP requests, we send information about who is making the the request. That is, we tell the server what user agent we are.
Because to scrape content from websites, rvest
relies on httr
for making HTTP requests, which, in turn, makes these request using the curl
program, by default we send this as user agent information when we execute httr::GET
or rvest::session()
.
sess$response$request$options$useragent
The same applies when we read and parse the HTML code of a website with xml2::read_html()
because, to read data from a connection, under the hood xml2::read_html()
uses curl
as well.
There are many reasons why we want to overwrite this default user agent information:
Remember how we said that rvest::session()
simply wraps httr::GET()
and this makes configuring interactions with websites really easy.
Handling user agents is a perfect point in case.
When creating a session object, we simply pass a user agent object created with httr::user_agent()
to the ...
argument of rvest::session()
:
ua <- "Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.114 Safari/537.36" sess <- session(url, httr::user_agent(ua)) # check that this was successfull is.null(sess$config$options$useragent) # should be FALSE sess$config$options$useragent == ua # should be TRUE
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.