Home

/

GitHub

/

README.md
In mroliversmith/PRAClone: Scraper for the Prudential Regulation Authority rulebook

PRArulebook

PRArulebook is a package to scrape the PRA (Prudential Regulation Authority) Rulebook (the website containing the rules made and enforced by the PRA under powers conferred by the Financial Services and Markets Act 2000 (FSMA)).

The input to this package is the PRA Rulebook website. Outputs from this package are the rules published on the PRA Rulebook website in a format more amenable to text and network analysis.

PRArulebook was developed while preparing:

Amadxarif, Z., Brookes, J., Garbarino, N., Patel, R., Walczak, E. (2019) The Language of Rules: Textual Complexity in Banking Reforms. Staff Working Paper No. 834. Bank of England.

Any use of this package with the PRA Rulebook must comply with the PRA Rulebook’s Terms of Use. These include, but are not limited to, restrictions on using content from the PRA Rulebook for commercial purposes without obtaining a licence from the PRA.

You can install the development version of PRArulebook from GitHub with:

install.packages("devtools")
devtools::install_github("erzk/PRArulebook")

PRArulebook scrapes two types of data: structure and content.

Structure - hierarchy of the PRA Rulebook. Includes URLs and names.
Content
- Text - can be used for text analysis
- Network - can be used for network analysis.

The next section shows how to extract these types of data.

Load the package

library(PRArulebook)

The simplest way to extract a rulebook structure is to use get_structure function

# get the structure of the rulebook down to the part-level
parts <-
  get_structure("16-11-2007",
                layer = "part")
# or chapter-level
# warnings (410) are displayed for inactive sites
chapters <-
  get_structure("18-06-2019",
                layer = "chapter")

This will start scraping the PRA rulebook. Warnings (410 code) will be displayed when a page is no longer active. Pulling data will take longer if you decide to pull more granular data. The rulebook has several layers and each of them can be passed to the layer argument of get_structure (in descending order):

sector
part
chapter

The output will be a data frame with information about the structure (i.e. URLs and names).

Scraping individual rules is much slower so another function should be used

# extract all rules from the first three chapters
rules <- scrape_rule_structure(chapters[1:3,], "18-06-2019")

Once the structure URLs are scraped, they can be used to extract content.

Text

To get content of the rulebook (text or links) use get_content function with a URL of a given chapter.

# scrape text from a single chapter
chapter_text <- get_content(chapters$chapter_url[1])
# or single rule
rule_text <- get_content(rules$rule_url[2], "text", "yes")

This function can be applied on the entire rulebook in the following way:

library(purrr)

chapters_text <-
  map_df(chapters$chapter_url[1:5],
                get_content)
# exception handling might be needed

The output can be then joined to the information about the rulebook structure and aggregated at a higher level.

Network

To scrape the links and create data set for network analysis get_content function can be used but with a type argument set to "links". Like in the previous example, this call can also be parallelised.

chapter_link <- get_content(chapters$chapter_url[1], "links")

# sequential
parts_links <-
  purrr::map_df(parts$part_url[1:5],
                get_content,
                "links")

The code above will return a data frame with from/to url, text used in a link, and a type of a link.

Scraped data containing information about the links can be used for network analysis (warning: further cleaning might be required).

This package is an outcome of a research project. All errors are mine. All views expressed are personal views, not those of any employer.

mroliversmith/PRAClone documentation built on Jan. 11, 2020, 2:05 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mroliversmith/PRAClone
Scraper for the Prudential Regulation Authority rulebook

README.md
In mroliversmith/PRAClone: Scraper for the Prudential Regulation Authority rulebook

PRArulebook

Installation

Data

Examples

Structure

Content

Text

Network

Disclaimer

R Package Documentation

Browse R Packages

We want your feedback!

mroliversmith/PRAClone Scraper for the Prudential Regulation Authority rulebook

README.md In mroliversmith/PRAClone: Scraper for the Prudential Regulation Authority rulebook

PRArulebook

Installation

Data

Examples

Structure

Content

Text

Network

Disclaimer

R Package Documentation

Browse R Packages

We want your feedback!

mroliversmith/PRAClone
Scraper for the Prudential Regulation Authority rulebook

README.md
In mroliversmith/PRAClone: Scraper for the Prudential Regulation Authority rulebook