read_html_live: Live web scraping (with chromote)
In rvest: Easily Harvest (Scrape) Web Pages

read_html_live

R Documentation

Live web scraping (with chromote)

Description

read_html() operates on the HTML source code downloaded from the server. This works for most websites but can fail if the site uses javascript to generate the HTML. read_html_live() provides an alternative interface that runs a live web browser (Chrome) in the background. This allows you to access elements of the HTML page that are generated dynamically by javascript and to interact with the live page by clicking on buttons or typing in forms.

Behind the scenes, this function uses the chromote package, which requires that you have a copy of Google Chrome installed on your machine.

Usage

read_html_live(url)

Arguments

url

Website url to read from.

Value

read_html_live() returns an R6 LiveHTML object. You can interact with this object using the usual rvest functions, or call its methods, like ⁠$click()⁠, ⁠$scroll_to()⁠, and ⁠$type()⁠ to interact with the live page like a human would.

Examples

## Not run: 
# When we retrieve the raw HTML for this site, it doesn't contain the
# data we're interested in:
static <- read_html("https://www.forbes.com/top-colleges/")
static %>% html_elements(".TopColleges2023_tableRow__BYOSU")

# Instead, we need to run the site in a real web browser, causing it to
# download a JSON file and then dynamically generate the html:

sess <- read_html_live("https://www.forbes.com/top-colleges/")
sess$view()
rows <- sess %>% html_elements(".TopColleges2023_tableRow__BYOSU")
rows %>% html_element(".TopColleges2023_organizationName__J1lEV") %>% html_text()
rows %>% html_element(".grant-aid") %>% html_text()

## End(Not run)

rvest documentation built on June 22, 2024, 10:47 a.m.