knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
PRArulebook
is a package to scrape the PRA (Prudential Regulation Authority) Rulebook (the website containing the rules made and enforced by the PRA under powers conferred by the Financial Services and Markets Act 2000 (FSMA)).
The input to this package is the PRA Rulebook website. Outputs from this package are the rules published on the PRA Rulebook website in a format more amenable to text and network analysis.
PRArulebook
was developed while preparing:
Amadxarif, Z., Brookes, J., Garbarino, N., Patel, R., Walczak, E. (2019) The Language of Rules: Textual Complexity in Banking Reforms. Staff Working Paper No. 834. Bank of England.
Any use of this package with the PRA Rulebook must comply with the PRA Rulebook's Terms of Use. These include, but are not limited to, restrictions on using content from the PRA Rulebook for commercial purposes without obtaining a licence from the PRA.
You can install the development version of PRArulebook
from GitHub with:
install.packages("devtools") devtools::install_github("erzk/PRArulebook")
PRArulebook
scrapes two types of data: structure and content.
Structure - hierarchy of the PRA Rulebook. Includes URLs and names.
Content
The next section shows how to extract these types of data.
Load the package
library(PRArulebook)
The simplest way to extract a rulebook structure is to use get_structure
function
# get the structure of the rulebook down to the part-level parts <- get_structure("16-11-2007", layer = "part") # or chapter-level # warnings (410) are displayed for inactive sites chapters <- get_structure("18-06-2019", layer = "chapter")
This will start scraping the PRA rulebook. Warnings (410 code) will be displayed when a page is no longer active. Pulling data will take longer if you decide to pull more granular data. The rulebook has several layers and each of them can be passed to the layer
argument of get_structure
(in descending order):
sector
part
chapter
The output will be a data frame with information about the structure (i.e. URLs and names).
Scraping individual rules is much slower so another function should be used
# extract all rules from the first three chapters rules <- scrape_rule_structure(chapters[1:3,], "18-06-2019")
Once the structure URLs are scraped, they can be used to extract content.
To get content of the rulebook (text or links) use get_content
function with a URL of a given chapter.
# scrape text from a single chapter chapter_text <- get_content(chapters$chapter_url[1]) # or single rule rule_text <- get_content(rules$rule_url[2], "text", "yes")
This function can be applied on the entire rulebook in the following way:
library(purrr) chapters_text <- map_df(chapters$chapter_url[1:5], get_content) # exception handling might be needed
The output can be then joined to the information about the rulebook structure and aggregated at a higher level.
To scrape the links and create data set for network analysis get_content
function can be used but with a type
argument set to "links"
. Like in the previous example, this call can also be parallelised.
chapter_link <- get_content(chapters$chapter_url[1], "links") # sequential parts_links <- purrr::map_df(parts$part_url[1:5], get_content, "links")
The code above will return a data frame with from/to url, text used in a link, and a type of a link.
Scraped data containing information about the links can be used for network analysis (warning: further cleaning might be required).
This package is an outcome of a research project. All errors are mine. All views expressed are personal views, not those of any employer.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.