knitr::opts_chunk$set( collapse = TRUE, comment = "#>" ) library(stortingscrape)
A wide variety of parliamentary data have been made available to the public in several countries over the last decade. Be it through frontend websites or back-end APIs, researchers on parliaments have never had easier access to large amounts of data than they do now. However, both frontend and API scraped data often come in formats (.html, .xml, .json, etc) that require substantial structuring and pre-processing before they are ready for subsequent analyses.
In this vignette, I present the stortingscrape
package for R.
stortingscrape
makes retrieving data from the Norwegian parliament
(Stortinget) through their easily accessible back-end API. The data
requested using the package require little to no further structuring. The scope
of the package, discussed further below, ranges from general data on the
parliament itself (rules, session info, committees, etc) to data on the parties,
bibliographies of the MPs, questions, hearings, debates, votes, and more.
Although this is the first attempt to make data on Stortinget more easily
accessible, stortingscrape
does not live in a vacuum. A variety of
parliamentary data for different countries are available for researchers to use
freely. For parliamentary debates, @Mat:Pan:Lee:06 were one of the first
to gather and make available data. Their data cover the proceedings of the 2005
House debates. @Eggers2014 structured the UK Hansard speech data, which
spans from 1802 to 2010. @Bee:Thi:Alb:17 provided continuously updated data
for the Canadian parliament, @Rauh2020 made available a collection of
speech data from 9 countries, and @Turner-Zwinkels2021 developed a
day-by-day dataset of MPs in Germany, Switzerland, and the Netherlands, in the
period between 1947 and 2017. These examples are, however, different from
stortingscrape
in that they are finished datasets ready for download and
have limited scope.
The main goal of stortingscrape
is to allow researchers to access any data
from the Norwegian parliament easily, but also still be able to structure the
data according to ones need. Most importantly, the package is facilitated for
weaving together different parts of the data.stortinget.no API.
I will start this vignette by briefly discussing the openly accessible
data.stortinget.no API. Next, I will describe the philosophy, scope and general
usage of the stortingscrape
package. Finally, I will present some minimal
examples of possible workflows for working with the package, before I summarize.
The Norwegian parliament was comparatively early in granting open access to
their data through an API when they launched data.stortinget.no in 2012.
The general purpose of the API is to priovide transparency in the form om raw
data, mirroring the frontend web-page information from stortinget.no. The
format of the API has been fairly consistent over the time of its existance, but
there have been some small style changes over different versions.^[See
stortingscrape::get_publication()
for instance] stortingscrape
was
built under version $1.6$ of the API.
Except for content that is blocked for the public (e.g. debates behind closed
doors), the API contains all recorded data produced in Stortinget. These data
include data on individual MPs, transcripts from debates, voting results,
hearing input, and much more. For a exhaustive list of all data sources in the
API.^[See https://martigso.github.io/stortingscrape/functions.html] The data
available in the API can be accessed through XML of JSON
format^[stortingscrape
exclusively works with XML.], both of which
are flexible formats for compressing data in nested lists.
As an exmple, the raw data input for general information about a single
MP^[stortingscrape::get_mp("MAAA")
] looks like this:
cat( "<person> <respons_dato_tid>2021-08-13T14:59:48.2114895+02:00</respons_dato_tid> <versjon>1.6</versjon> <doedsdato>0001-01-01T00:00:00</doedsdato> <etternavn>Aasen</etternavn> <foedselsdato>1967-02-21T00:00:00</foedselsdato> <fornavn>Marianne</fornavn> <id>MAAA</id> <kjoenn>kvinne</kjoenn> </person> " )
This is the typical XML structure in the API, although other parts of the data are more complex in that the XML tree can be nested multiple times. This will be discussed further in the next section.
stortingscrape
aims to make Norwegian parliamentary data easily
accessible, while also being flexible enough for tailoring the different
underlying data sources to ones needs. Indeed, contrary to most open source
parliamentary speech data, stortingscrape
aims at giving the user as much
agency as possible in tailoring data for specific needs. In addition to user
agency, the package is built with a core philosophy of simplifying data
structures, make seamless workflows between different parts of the
Storting API, and limit data duplication between functions.
Because a lot of analysis tools in R requires 2 dimensional data
formats, the stortingscrape
package prioritize converting the nested XML
format to data frames, when possible. However, some sources of data from the
Storting API are nested in a way which makes retaining all data in a 2
dimensional space either impossible or too verbose. For example, the
get_mp_bio()
function, which extract a specific MP's biography by id, has
data on MP personalia, parliamentary periods the MP had a seat, vocations,
literature authored by the MP, and more. In order to make all these data
workable, the resulting format from the function call is a list of data frames
for each part of the data. The different list elements are, however, easily
combined for different applications of the data.
One of the core thoughts behind the workflow of the package is to make it easy
to combine different parts of the API and to extract the data you actually need.
To facilitate this, most functions within stortingscrape
are built to work
seemlessly with the apply()
family or control flow constructs in
R. Because we do not want to call the API repeatedly, functions that
are expected to often be ran repeatedly have a good_manners
argument.
This will make R sleep for the set amount of seconds after calling
the API. It is recommended to set this argument to 2 seconds or higher on multiple
calls to the API. Generally, the package is built by the recommendations given
by the httr2
package [@wickham2023]^[Especially, see
https://httr2.r-lib.org/articles/wrapping-apis.html].
Most of the data from Stortinget's API and frontend web page are interconnected
through ids for the various sources (session id, MP id, case id, question id,
vote id, etc.). stortingscrape
core extraction methods are based around
these. One of the major benefits of this is that whether you want to extract,
for instance, a single question found on the frontend web page, or all
questions for a parliamentary session, the package is flexible enough to suit
both needs (see the workflow section). It will also enable users to
quickly retreive data from the frontend
web-page.^[stortinget.no as the ids are embedded in the
urls.]
Because of the interconnectedness of the API's data, there are some overlapping
sources of data. For instance, both retreival of MP general information
(get_mp()
), biography (get_mp_bio()
), and all MPs for a session
(get_parlperiod_mps()
) have the name of the MP in the API, but only
get_mp()
will return MP names in stortingscrape
, because these two
data sources are easily merged by the MP's id (see the workflow section).
The scope of stortingscrape
is almost the entire API of Stortinget, with
some notable shortcomings. First, there are no functions for dynamically
updated data sources, such as current speaker lists
(https://data.stortinget.no/dokumentasjon-og-hjelp/talerliste/). Second,
as mentioned above, duplicated data i avoided whenever possible. Third, certain
unstandardized image sources -- such as publication attachment figures -- are
not supported in the package. And finally, publications from the
get_publication()
function can be retrieved, but are returned in a
parsed XML data format from the rvest
package because these data are
not standardized across different publications.
There are three overarching sources of data in stortingscrape
: 1)
Parliamentary structure data, 2) MP data, and 3) Parliamentary activity data.
These are, in some/most cases, linked by various forms of ID tags. For example,
retrieving all MPs for a given session (get_parlperiod_mps()
) will give
access to MP IDs (mp_id
) for that session, which can be used to extract
biographies, pictures, speech activity, and more for those MPs. Next, I will
showcase some examples of how a typical workflow for using stortingscrape
could look like.
In the following section, I will discuss some examples of data extraction with
stortingscrape
. I start by showing basic extraction of voting data based on
vote IDs from the frontend web-page --
stortinget.no. Next, I exemplify the large set of
period and session specific data by retrieving all MPs for a specific
parliamentary period and all interpellations for a specified parliamentary
session. Finally, I show how the different functions of the stortingscrape
package works together -- merging data on cases with their belonging vote
results. Note that the vignette is built using the examples in the
data folder of the package.^[This is done in order to not call the API each
time the vignette is built.]
data_files <- data(package = "stortingscrape")$results[,"Item"] data(list = data_files)
The basic extraction of specific data from Stortinget's API revolves around various forms of ID tags. For example, all MPs have a unique ID, all cases have unique IDs, all votes have unique IDs, and so on. For the following example, I will highlight going from a case on economic measures for the Covid pandemic to party distribution on a specific vote in this case. First, the case was relatively rapidly proposed and treated in the Storting during the early days of June 2021. The case in its entirety can be found at here. You will see the procedure steps from a government proposal, through work in the finance committee, to debate and decision. Lets say a particular proposal under the case caught our eye -- for instance, vote number 61 from the Labor Party asking the government to propose a plan for implementing the International Labor Organization's core conventions to the Human Rights Act (menneskerettighetsloven).
As can be seen from the link to the case itself, we have an ID within the URL:
"85196". This is the case ID. We can use the get_case()
function from
stortingscrape
to extract all votes on this case:
covid_relief <- get_vote("85196")
We now have a data frame with 71 votes over 22 variables. The data structure for some selected variables, looks like this:
head(covid_relief[, c("case_id", "vote_id", "n_for", "n_against", "adopted")])
As we are interested in the result of proposal 217 from the Labor Party, we can extract the ID of this particular vote from our data:
covid_relief$vote_id[which(grepl("217", covid_relief$vote_topic))]
To get the personal MP vote results for this particular vote, we can use the
get_result_vote()
function:^[I have not decided if data values
should be translated or not. In this case, "for" is "for", "mot" is
"against", and "ikke_tilstede" is "absent".}]
covid_relief_result <- get_result_vote("17689") head(covid_relief_result[, c("vote_id", "mp_id", "party_id", "vote")])
From looking only at the first six rows of the data, the readers who know the Norwegian political system will suspect that this vote was an opposition versus government vote, but we can also easily get the distribution of votes by party:
table(covid_relief_result$party_id, covid_relief_result$vote) |> addmargins()
As suspected, the vote was divided between the opposition (A, MDG, R, SP, and SV) and government parties (H, KrF, V, and FrP), and was not adopted by a thin margin of 2 votes. Of course, this is a minimal example, but I will highlight more methods for extracting multiple votes below.
Below, I show two examples of sequentially extracting data of interest.
Most of the mentioned IDs for Stortinget's data are not only extractable from
the frontend web-page, but also from the back-end API. These data can be
retrieved by various forms of parliamentary period or session specific functions
in stortingscrape
. In this section, I will show how to get all MPs for a
specific parliamentary period and all interpellations for a parliamentary
session.
First, however, I note that IDs for periods and sessions are accessed through two core functions in the package:
parl_periods <- get_parlperiods() parl_sessions <- get_parlsessions() tail(parl_periods[,c("id", "years")]) tail(parl_sessions[,c("id", "years")])
The parliamentary period IDs is mainly used for MP data; Norwegian MPs are elected for 4 year terms, with no constitutional arrangement for snap elections. The MP data also stretch way further back in time than most of the other data in the API:
parl_periods$id[nrow(parl_periods)]
mps4549 <- get_parlperiod_mps("1945-49") head(mps4549[, c("mp_id", "county_id", "party_id", "period_id")])
From these data, the way is short to extracting more rich data on individual MPs, as will be demonstrated below.
Content data, however, use parliamentary session IDs rather than period IDs.
These functions are standardized to function names as get_session_*()
. For
example, we can access all interpellations from the 2002-2003 session with the
get_session_questions()
function:
interp0203 <- get_session_questions("2002-2003", q_type = "interpellasjoner") dim(interp0203)
Here, we have 22 interpellations over 26 different variables. Unfortunately, the
API only gives the question and not the answer for the different types of
question requests. Retrieval of question answers is a daunting task, because it
is only accessible through the unstandardized get_publication()
function.
Next, I showcase how to get go from cases in a section, through extracting a case of interest and vote results, to vote matrices for that case.
First, I extract all cases in the 2019-2020 session:
cases <- get_session_cases("2019-2020")
The cases
object will here contain all cases treated in the 2019-2020
parliamentary session. Do note that cases
is a list of 4 elements
($root
, $topics
, $proposers
, and $spokespersons
). In the following, I use
the case ID in $root
to access vote information for a case -- in this example
the 48th row in the data:^[I will note that it is possible to extract
vote information on all cases by either using the apply()
family or
control flow constructs available in R. However, in this case,
calling the API 616 (nrow(cases[["root"]])
) times, will require to pause
between calls (with the {good_manners
argument). This will increase
running time substantially.]
# The case titles are, unfortunately, not translated cases$root$title_short[48]
vote <- get_vote(cases$root$id[48]) vote[, c("case_id", "vote_id", "alternative_vote", "n_for", "n_absent", "n_against")]
The output gives us a data frame of three votes over 22 variables, whereof one is
the vote ID for each of the votes. We can use this variable to retrieve rollcall
data, using the get_result_vote
function:
vote_result <- lapply(vote$vote_id, get_result_vote, good_manners = 5) names(vote_result) <- vote$vote_id vote_result <- do.call(rbind, vote_result) head(vote_result[, 3:ncol(vote_result)])
And make an overall proportion table over party distribution for the three votes:
table(vote_result$vote, vote_result$party_id, dnn = c("Vote result", "Vote ID")) |> prop.table(margin = 2) |> round(digits = 2)
In this vignette, I have presented the philosophy, scope, usage, and workflow of the
stortingscrape
package for R. In sum, stortingscrape
makes retrieving data
from the Norwegian parliament (Stortinget) more accessible through the back-end
API (data.stortinget.no). One core philosophy of
the package is to let the user tailor the data to ones needs, while at the same
time extracting minimal overlapping data. The scope of the package ranges from
general data on the parliament itself (rules, session info, committees, etc)
to data on the parties, bibliographies of the MPs, questions, hearings, debates,
votes, and more.
A list of all functions and their description can be found in the package documentation within R or from github.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.