In mroliversmith/PRAClone: Scraper for the Prudential Regulation Authority rulebook

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

library(PRArulebook)

Get the structure:

sectors <- scrape_sector_structure("http://www.prarulebook.co.uk/rulebook/Home/Handbook/22-03-2006")
parts <- scrape_part_structure(sectors)

See what was scraped:

library(dplyr)

dplyr::glimpse(parts)

Visualise the structure:

library(collapsibleTree)

Handbook <- parts

collapsibleTree(
  Handbook,
  hierarchy = c("sector_name", "part_name"),
  width = 900,
  height = 1100,
  zoomable = FALSE,
  collapsed = FALSE
)

Obtaining rule-level content (including rule URLs) takes a bit longer. First, chapters need to be scraped. Then data frame containing chapters can be used to obtain rules:

chapters <- scrape_chapter_structure(parts)

rules <-
  scrape_rule_structure(chapters[1:3],
                        rulebook_date = "22-03-2006")

This will generate a data frame with rule-level structure. The next step (if you require the lowest level of the data) is obtaining the rule IDs and text. This is very slow as individual rule IDs are not so easy to extract and the scraper needs to visit every single rule URL.

# `get_content` needs to be called on each rule that needs to be scraped
rule_text <- get_content(rules$rule_url[1], "text", "yes")

Faster scraping

future and furrr packages were tested to speed up the process of scraping, but this method often resulted in errors so eventually it was not used in this package. Instead, purrr was used.

Here is an example of using furrr to acquire chapter-level data:

# scrape part-level data
df <-
  get_structure("01-01-2010",
                layer = "part")

# start multicore processing
library(future)
plan(multiprocess)

# get all chapters and append to a data frame
chapters <-
  furrr::future_map_dfr(df$part_url,
                scrape_menu, selector = ".Chapter a",
                .progress = TRUE)