In leonardodiegues/chewie: Chewie

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

chewie

The goal of chewie is to easily scrape pages without having to call multiple extraction methods over and over again. It simplifies the process by feeding a scheme to a single method called chew.

A scheme is like a recipe that gives chewie the guidelines to where elements are in a page and how to extract them. Each page you choose to "chew" should have a single scheme composed by a list of instruction objects.

A instruction is formed by the following 6 fields (more fields could be added on future releases):

title: an arbitrary name to the scraped object
selector: whether path/alternative_path is a css or xpath selector, defaults to NULL
path: a css or xpath path to the object to be scraped
alternative_path: an alternative css or xpath path to the object to be scraped
parse_as: indicates if an extractor should be applied to the resulting scraped item. To return raw objects, just leave it as NULL. Currently available extractors are:
- text
- numeric
- table
- date
- datetime
- difftime
- price
pattern: a RegEx pattern to be applied before parsing

Quickstart

Instalation

You can install the released version of chewie with:

remotes::install_github("leonardodiegues/chewie")

Usage

Schemes can be loaded either from instantiating a scheme or a data.frame object. The following chunk exemplifies both manners by looking at Rio 2016 100 meters butterfly results:

library(chewie)

swimming_100m_butterfly <- "http://www.olympedia.org/results/357088"

# Directly load scheme from a `data.frame` containing columns corresponding to available fields.
page_scheme <- tibble::tribble(
  ~title          , ~path,                                           ~selector, ~parse_as, ~pattern,
  "event_name"    , "h1:nth-of-type(1)"                            , NULL     , "text"   , NULL,
  "event_location", "//table[1]/tr[3]/td[1]"                       , "xpath"  , "text"   , NULL,
  "n_participants", "table:nth-of-type(1) > tr:nth-of-type(4) > td", NULL     , "numeric", "^(\\d+) ",
  "event_results" , "//table[2]"                                   , "xpath"  , "table"  , NULL
)

# Or manually add all fields
page_scheme <- scheme(
  list(
    instruction(
      title = "event_name",
      path = "h1:nth-of-type(1)",
      parse_as = "text",
    ),
    instruction(
      title = "event_location",
      path = "//table[1]/tr[3]/td[1]",
      selector = "xpath",
      parse_as = "text"
    ),
    instruction(
      title = "n_participants",
      path = "table:nth-of-type(1) > tr:nth-of-type(4) > td",
      parse_as = "numeric",
      pattern = "^(\\d+) "
    ),
    instruction(
      title = "event_results",
      path = "//table[2]",
      selector = "xpath",
      parse_as = "table"
    )
  )
)

# Chew page based on scheme
results <- chew(scheme = page_scheme, url = swimming_100m_butterfly)

print(results)

Generally extraction methods are wraps around rvest::html_text2 and stringr::str_extract. In the case of extract_table it would be useful if we could not only pull the table as a data.frame (using rvest::html_table) but add new URL columns based on columns that have a tags attached to them. This is extract_table default behavior and can't be changed yet (work in progress).

Let's check the resulting table from the fourth instruction:

tbl <- results[[4]]$result

print(tbl)

Parsed HTML pages can also be chewed:

swimming_100m_butterfly_page <- swimming_100m_butterfly |>
  httr::GET() |> 
  httr::content(as = "text") |> 
  rvest::read_html()

chew(scheme = page_scheme, page = swimming_100m_butterfly_page)