GermaParl

Sys.setenv("CORPUS_REGISTRY" = "")

A Corpus of Plenary Protocols

Parliamentary debates convey the arguments, interpretations and conflicts that shape political decision-making. They are recorded and transcribed by parliamentary administrations with diligence, and published as plenary protocols. These documents are available for long periods of time -- for several decades and more -- and they cover the full breadth of topics that is processed by a political system: Plenary protocols are a valuable resource for research on policy and politics, and for citizens. They are a crucial building block of the public digital archive of democracy.

The digital availability of plenary protocols is excellent and limited at the same time. Documents can be downloaded without technical or legal restrictions as txt, html or pdf documents. But this data format does not match the requirements for digital-era data processing. To exploit the analytical potential of the data, original documents need to be converted into a semi-structured data format (typically XML). This is the essence of the preparation of the GermaParl corpus.

The GermaParl corpus as it has been prepared in the PolMine Project is based on an XMLification of txt and pdf documents. The aim is to make a contribution to the broader development of preparing corpora of plenary protocols -- the naming of GermaParl is inspired by the EuroParl and DutchParl corpus [@DBLP:conf/lrec/MarxS10]. GermaParl is intended to serve as an example how to make corpora available in a sustainable way, meeting the technological standards and requirements of the digital era.

GermaParl is made available in two ways:

The corpus includes all plenary protocols that were published by the German Bundestag between 1996 and 2016. Plain text documents issued by the German Bundestag were considered the best raw data format for corpus preparation and were used whenever they are available. For a period between 2008 and 2010, txt files are not available throughout. To fill the gap, pdf documents were processed. The following sections explain corpus preparation and the data made available with this package in some more detail.

Corpus Preparation

The preparation of the TEI version of GermaParl implements the following workflow:

The preprocessing step is not as trivial as it might seem. Older plain text files are offered by the German Bundestag in all kinds of encodings that are have come out of use. The pdf documents have a two-column layout that is difficult to handle. For coping with the two-column layout of pdf documents, the R package trickypdf has been developed.

The essential instrument for the XMLification is a set of regular expressions to extract relevant metadata (such as legislative period, session number or date), to find the beginning and the end of a debate, the call upon the speakers, and the beginning and end of agenda items. The matches are used to generate the structural annotation of parliamentary speech in the XML document.

Due to the remaining haphazard variations that occur in plenary protocols, the quest is not for a universal battery of regular expressions that would always work without manual checks. Even though it introduces an element of manual work, our solution is to work with lists of mismatches (matches of regular expressions to omit), in combination with a brute-force preprocessing (hard-coded substitutions) that correct obvious errors that are already included in the original version of plenary protocol.

The result of the base XMLification still includes considerable noise. Inconsistencies that occur with names are a particularly serious issue. To obtain consolidated metadata, information that has been extracted is checked (including approximate string matching) against an external data source. We opted for lists of members of parliamentarians, cabinet members and further speakers that are available on Wikipedia (see the page for the 17th Bundestag, for instance).

Easy digital access is not the only justification for this choice. Wikipedia pages for the parliamentary sessions are being taken care of by a team of dedicated volunteers. Furthermore, Wikipedia pages meet permanent public scrutiny, ensuring quality checks on the data quality in a manner traditional printed material does not necessarily ensure.

Annotation

Linguistic Annotation

The XML/TEI version of GermaParl is taken through a pipeline of standard Natural Language Processing (NLP) tasks. Starting with version 1.1.0, Stanford CoreNLP is used for tokenization and part-of-speech (POS) annotation. To add lemmas to the corpus, the TreeTagger is used.

Note that the TreeTagger outputs #unknown# if it cannot successfully lemmatize a wordform. See the tables in the annex to learn about the unknown ratio in the corpus.

Moving to Stanford CoreNLP as the base annotation tool will be the basis for adding further annotation layers to the corpus (such as an annotation of sentences, or named entities) in the future.

Structural Annotation (Metadata)

In the XML/TEI data format, all passages of uninterrupted speech are tagged with metadata, or so-called structural attributes (s-attributes). For instance, parliamentary speeches are often interrupted by interjections - the information whether an utterance is an interjection or an actual speech, is maintained in the corpus. The legislative period, session, date, name of a speaker and his/her parliamentary group are included, among others. The structural annotation is the basis for all kinds of diachronic or synchronic comparisons users may want to perform.

The following table provides short explanations of the s-attributes that are present in the GermaParl corpus.

s_attrs <- list(
  c("lp", "legislative period", "13 to 18"),
  c("session", "session/protocol number", "1 to 253"),
  c("src", "source material for data preparation", "txt or pdf"),
  c("url", "URL", "URL of the original document"),
  c("agenda_item", "agenda item", "number of the agenda item"),
  c("agenda_item_type", "type of agenda item", "debate/question_time/government_declaration/..."),
  c("date", "date of the session", "YYYY-MM-TT (e.g. '2013-06-28')"),
  c("year", "year of the session", "1996 to 2016"),
  c("interjection", "whether contribution is interjection", "TRUE/FALSE"),
  c("role", "role of the speaker", "presidency/mp/government/..."),
  c("speaker", "Name", "speaker name"),
  c("parliamentary_group", "Parliamentary group", "partliamentary group the speaker is affiliated with"),
  c("party", "Party", "party of the speaker")
)
tab <- do.call(rbind, s_attrs)
colnames(tab) <- c("s-attribute", "description", "values")
knitr::kable(tab, format = "markdown")

Using the GermaParl corpus

Getting started - installing GermaParl

The GermaParl R Data Package can be installed from CRAN. It is a "pure R" package. There are no differences whether you install on Linux, MacOS or Windows.

install.packages("GermaParl")

You can then load the GermaParl package.

library(GermaParl)

After installing the GermaParl package, the package only includes a small subset of the GermaParl corpus. The subset serves as sample data and for running package tests. To download the full corpus, use a function to download the full corpus from Zenodo, an open science data repository:

germaparl_download_corpus()

To check whether the installation has been successful, run the following commands.

germaparl_is_installed()
germaparl_get_doi()
germaparl_get_version()

Working with GermaParl

The CWB indexed version if GermaParl can be used with the CWB itself, or with any tool that uses the CWB as a backend (such as CQPweb). However, most technical decisions during corpus preparation had in mind to optimize using the GermaParl corpus in combination with the polmineR package. Please consult the documentation of the polmineR package (README, vignette, manual) to learn how to use polmineR for working with GermaParl.

Some caveats

A set of general remarks may help to avoid pitfalls when working with GermaParl:

Perspectives

GermaParl is not the only corpus of plenary protocols. Apart from PolMine, several further projects have worked on preparing respective corpora. There is an ensuing dialogue among these projects, and GermaParl strives to contribute to this broader development, trying to serve as a example how reproducibility and quality control can be maintained during all steps of corpus preparation, and how corpora can be disseminated in a manner that lowers the barriers of entry for new users.

The resource is under active development. Apart from improving data quality and preparing updates, one important concern is to add further annotation layers:

Hopefully, GermaParl will make qualitative and quantitative research with German parliamentary debates more productive. A final note on user feedback: Improving data quality is an important concern of the PolMine Project. This is why the data is versioned. The resource will benefit from its community of users - feedback that helps to improve data quality is more than welcome!

License

While the raw data, the plenary protocols published by the German Bundestag, are in the public domain, the GermaParl corpus comes with a CC BY-SA license. That means:

BY - Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.

SA - ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.

See the CC Attribution-ShareAlike 4.0 Unported License for further explanations.

Quotation

The 'GermaParl' R package and the 'GermaParl' corpus are two different pieces of research data, with different version numbers, document object identifiers (DOIs) and recommendations for quotation. If you use GermaParl for your research, maximum transparency on the tools you used is attained, when both the package and the corpus is quoted in your publication. To ensure the reproducibility of your research, it is more important to refer to and specify the corpus (including version, DOI) you used.

Blaette, Andreas (2020): GermaParl. Download and Augment the Corpus of Plenary Protocols of the German Bundestag. R package version 1.3.0. https://CRAN.R-project.org/package=GermaParl

Blaette, Andreas (2020): GermaParl. Linguistically Annotated and Indexed Corpus of Plenary Protocols of the German Bundestag. CWB corpus version 1.0.6. https://doi.org/10.5281/zenodo.3735141

NOTE: In an R session, you can get this recommendation how to quote GermaParl by calling the usual citation() function:

citation("GermaParl")

Annex

Corpus data (by electoral period)

knitr::kable(GermaParl::germaparl_by_lp, format = "markdown")

Corpus data (by year)

summary_row <- t(data.frame(colSums(GermaParl::germaparl_by_year)))
rownames(summary_row) <- NULL
content_rows <- GermaParl::germaparl_by_year
rownames(content_rows) <- NULL
tab <- rbind(content_rows, summary_row)
tab[["year"]] <- as.character(tab[["year"]])
tab[nrow(tab), "year"] <- "TOTAL"
tab[nrow(tab), "unknown_share"] <- round(
  sum(GermaParl::germaparl_by_year[["unknown_total"]]) / sum(GermaParl::germaparl_by_year[["size"]]),
  digits = 3
)
colnames(tab)[6:7] <- c("unknown (total)", "unknown (share)")
knitr::kable(tab, format = "markdown")

Data sources

Starting from the 17th legislative period, txt versions of plenary protocols were retrieved from the homepage of the German Bundestag. The following table reports the URLs that have been used to download txt versions of older plenary protocols.

urls <- list(
  c("1996", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/1996/index.htm"),
  c("1997", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/1997/index.htm"),
  c("1998", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/1998/index.htm"),
  c("1999", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/1999/index.htm"),
  c("2000", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/2000/index.htm"),
  c("2001", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/2001/index.htm"),
  c("2002", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/2002/index.html"),
  c("2003", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/2003/index.html"),
  c("2004", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/2004/index.html"),
  c("2005", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/2005/index.html"),
  c("2006", "http://webarchiv.bundestag.de/archive/2008/0912/bic/plenarprotokolle/pp/2006/index.html"),
  c("2007", "http://webarchiv.bundestag.de/archive/2008/0912/bic/plenarprotokolle/pp/2007/index.html"),
  c("2008", "http://webarchiv.bundestag.de/archive/2008/0912/bic/plenarprotokolle/pp/2008/index.html")
)
tab <- do.call(rbind, urls)
colnames(tab) <- c("year", "URL")
knitr::kable(tab, format = "markdown")

References



Try the GermaParl package in your browser

Any scripts or data that you put into this service are public.

GermaParl documentation built on Oct. 23, 2020, 8:27 p.m.