knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The goal of chewie is to easily scrape pages without having to call multiple extraction methods over and over again. It simplifies the process by feeding a scheme
to a single method called chew
.
A scheme
is like a recipe that gives chewie the guidelines to where elements are in a page and how to extract them. Each page you choose to "chew" should have a single scheme composed by a list of instruction objects.
A instruction
is formed by the following 6 fields (more fields could be added on future releases):
title
: an arbitrary name to the scraped objectselector
: whether path
/alternative_path
is a css or xpath selector, defaults to NULL
path
: a css or xpath path to the object to be scrapedalternative_path
: an alternative css or xpath path to the object to be scrapedparse_as
: indicates if an extractor should be applied to the resulting scraped item. To return raw objects, just leave it as NULL
. Currently available extractors are:text
numeric
table
date
datetime
difftime
price
pattern
: a RegEx pattern to be applied before parsingYou can install the released version of chewie
with:
remotes::install_github("leonardodiegues/chewie")
Schemes can be loaded either from instantiating a scheme
or a data.frame
object. The following chunk exemplifies both manners by looking at Rio 2016 100 meters butterfly results:
library(chewie) swimming_100m_butterfly <- "http://www.olympedia.org/results/357088" # Directly load scheme from a `data.frame` containing columns corresponding to available fields. page_scheme <- tibble::tribble( ~title , ~path, ~selector, ~parse_as, ~pattern, "event_name" , "h1:nth-of-type(1)" , NULL , "text" , NULL, "event_location", "//table[1]/tr[3]/td[1]" , "xpath" , "text" , NULL, "n_participants", "table:nth-of-type(1) > tr:nth-of-type(4) > td", NULL , "numeric", "^(\\d+) ", "event_results" , "//table[2]" , "xpath" , "table" , NULL ) # Or manually add all fields page_scheme <- scheme( list( instruction( title = "event_name", path = "h1:nth-of-type(1)", parse_as = "text", ), instruction( title = "event_location", path = "//table[1]/tr[3]/td[1]", selector = "xpath", parse_as = "text" ), instruction( title = "n_participants", path = "table:nth-of-type(1) > tr:nth-of-type(4) > td", parse_as = "numeric", pattern = "^(\\d+) " ), instruction( title = "event_results", path = "//table[2]", selector = "xpath", parse_as = "table" ) ) ) # Chew page based on scheme results <- chew(scheme = page_scheme, url = swimming_100m_butterfly) print(results)
Generally extraction methods are wraps around rvest::html_text2
and stringr::str_extract
. In the case of extract_table
it would be useful if we could not only pull the table as a data.frame
(using rvest::html_table
) but add new URL columns based on columns that have a
tags attached to them. This is extract_table
default behavior and can't be changed yet (work in progress).
Let's check the resulting table from the fourth instruction:
tbl <- results[[4]]$result print(tbl)
Parsed HTML pages can also be chewed:
swimming_100m_butterfly_page <- swimming_100m_butterfly |> httr::GET() |> httr::content(as = "text") |> rvest::read_html() chew(scheme = page_scheme, page = swimming_100m_butterfly_page)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.