Sys.setenv("CORPUS_REGISTRY" = "")
Parliamentary debates convey the arguments, interpretations and conflicts that shape political decision-making. They are recorded and transcribed by parliamentary administrations with diligence, and published as plenary protocols. These documents are available for long periods of time -- for several decades and more -- and they cover the full breadth of topics that is processed by a political system: Plenary protocols are a valuable resource for research on policy and politics, and for citizens. They are a crucial building block of the public digital archive of democracy.
The digital availability of plenary protocols is excellent and limited at the same time. Documents can be downloaded without technical or legal restrictions as txt, html or pdf documents. But this data format does not match the requirements for digital-era data processing. To exploit the analytical potential of the data, original documents need to be converted into a semi-structured data format (typically XML). This is the essence of the preparation of the GermaParl corpus.
The GermaParl corpus as it has been prepared in the PolMine Project is based on an XMLification of txt and pdf documents. The aim is to make a contribution to the broader development of preparing corpora of plenary protocols -- the naming of GermaParl is inspired by the EuroParl and DutchParl corpus [@DBLP:conf/lrec/MarxS10]. GermaParl is intended to serve as an example how to make corpora available in a sustainable way, meeting the technological standards and requirements of the digital era.
GermaParl is made available in two ways:
The base format of the corpus is a XMLification of the raw data (i.e. the original protocols) that is modelled on the standards of the Text Encoding Initiative (TEI). Releases of the TEI format of GermaParl are available via the GermaParlTEI repository at GitHub. It results from an automatized, fully reproducible pipeline, i.e. raw protocols (pdf and txt file format) have been turned into an XML/TEI-based standard.
A linguistically annotated, indexed and consolidated version of the corpus is disseminated with the GermaParl R data package documented with this vignette. XML/TEI files serve the point of departure, but the version emanates from (a) performing some standard Natural Language Processing (NLP) tasks (such as tokenization, lemmatization, part-of-speech annotation); (b) importing the linguistically annotated data into the Corpus Workbench (CWB) and (c) consolidating the data in a set of postprocessing tasks to remove known errors.
The corpus includes all plenary protocols that were published by the German Bundestag between 1996 and 2016. Plain text documents issued by the German Bundestag were considered the best raw data format for corpus preparation and were used whenever they are available. For a period between 2008 and 2010, txt files are not available throughout. To fill the gap, pdf documents were processed. The following sections explain corpus preparation and the data made available with this package in some more detail.
The preparation of the TEI version of GermaParl implements the following workflow:
Preprocessing: Prepare consolidated UTF-8 plain text documents (ensuring uniformity of encodings, conversion of pdf to txt if necessary);
XMLification: Turn the plain text documents into TEI format: Extraction of metadata, annotation of speakers etc.;
Consolidation: Consolidating speaker names and enriching documents.
The preprocessing step is not as trivial as it might seem. Older plain text files are offered by the German Bundestag in all kinds of encodings that are have come out of use. The pdf documents have a two-column layout that is difficult to handle. For coping with the two-column layout of pdf documents, the R package trickypdf has been developed.
The essential instrument for the XMLification is a set of regular expressions to extract relevant metadata (such as legislative period, session number or date), to find the beginning and the end of a debate, the call upon the speakers, and the beginning and end of agenda items. The matches are used to generate the structural annotation of parliamentary speech in the XML document.
Due to the remaining haphazard variations that occur in plenary protocols, the quest is not for a universal battery of regular expressions that would always work without manual checks. Even though it introduces an element of manual work, our solution is to work with lists of mismatches (matches of regular expressions to omit), in combination with a brute-force preprocessing (hard-coded substitutions) that correct obvious errors that are already included in the original version of plenary protocol.
The result of the base XMLification still includes considerable noise. Inconsistencies that occur with names are a particularly serious issue. To obtain consolidated metadata, information that has been extracted is checked (including approximate string matching) against an external data source. We opted for lists of members of parliamentarians, cabinet members and further speakers that are available on Wikipedia (see the page for the 17th Bundestag, for instance).
Easy digital access is not the only justification for this choice. Wikipedia pages for the parliamentary sessions are being taken care of by a team of dedicated volunteers. Furthermore, Wikipedia pages meet permanent public scrutiny, ensuring quality checks on the data quality in a manner traditional printed material does not necessarily ensure.
The XML/TEI version of GermaParl is taken through a pipeline of standard Natural Language Processing (NLP) tasks. Starting with version 1.1.0, Stanford CoreNLP is used for tokenization and part-of-speech (POS) annotation. To add lemmas to the corpus, the TreeTagger is used.
Note that the TreeTagger outputs #unknown# if it cannot successfully lemmatize a wordform. See the tables in the annex to learn about the unknown ratio in the corpus.
Moving to Stanford CoreNLP as the base annotation tool will be the basis for adding further annotation layers to the corpus (such as an annotation of sentences, or named entities) in the future.
In the XML/TEI data format, all passages of uninterrupted speech are tagged with metadata, or so-called structural attributes (s-attributes). For instance, parliamentary speeches are often interrupted by interjections - the information whether an utterance is an interjection or an actual speech, is maintained in the corpus. The legislative period, session, date, name of a speaker and his/her parliamentary group are included, among others. The structural annotation is the basis for all kinds of diachronic or synchronic comparisons users may want to perform.
The following table provides short explanations of the s-attributes that are present in the GermaParl corpus.
s_attrs <- list( c("lp", "legislative period", "13 to 18"), c("session", "session/protocol number", "1 to 253"), c("src", "source material for data preparation", "txt or pdf"), c("url", "URL", "URL of the original document"), c("agenda_item", "agenda item", "number of the agenda item"), c("agenda_item_type", "type of agenda item", "debate/question_time/government_declaration/..."), c("date", "date of the session", "YYYY-MM-TT (e.g. '2013-06-28')"), c("year", "year of the session", "1996 to 2016"), c("interjection", "whether contribution is interjection", "TRUE/FALSE"), c("role", "role of the speaker", "presidency/mp/government/..."), c("speaker", "Name", "speaker name"), c("parliamentary_group", "Parliamentary group", "partliamentary group the speaker is affiliated with"), c("party", "Party", "party of the speaker") ) tab <- do.call(rbind, s_attrs) colnames(tab) <- c("s-attribute", "description", "values") knitr::kable(tab, format = "markdown")
The GermaParl R Data Package can be installed from CRAN. It is a "pure R" package. There are no differences whether you install on Linux, MacOS or Windows.
install.packages("GermaParl")
You can then load the GermaParl package.
library(GermaParl)
After installing the GermaParl package, the package only includes a small subset of the GermaParl corpus. The subset serves as sample data and for running package tests. To download the full corpus, use a function to download the full corpus from Zenodo, an open science data repository:
germaparl_download_corpus()
To check whether the installation has been successful, run the following commands.
germaparl_is_installed() germaparl_get_doi() germaparl_get_version()
The CWB indexed version if GermaParl can be used with the CWB itself, or with any tool that uses the CWB as a backend (such as CQPweb). However, most technical decisions during corpus preparation had in mind to optimize using the GermaParl corpus in combination with the polmineR package. Please consult the documentation of the polmineR package (README, vignette, manual) to learn how to use polmineR for working with GermaParl.
A set of general remarks may help to avoid pitfalls when working with GermaParl:
Plenary protocols meticulously report interjections. To maintain the integrity of the original documents, interjections are annotated in the corpus. By using the s-attribute 'interjection' that assumes the values TRUE
or FALSE
, you can limit your analysis to speech or interjections.
Plenary protocols report membership in a parliamentary group only. Information on party membership is derived from external data sources and written back to the corpus. More specifically, the s-attribute 'parliamentary_group' refers to the parliamentary group, 'party' refers to the party a speaker is a member of. To distinguish between CDU and CSU speakers, using the s-attribute 'party' is necessary.
The GermaParl corpus is a corpus of the debates and speeches that were actually given in the German Bundestag. Speeches that were only included in the printed protocol (i.e. included in the annex to a protocol) are not yet covered by corpus preparation.
GermaParl is not the only corpus of plenary protocols. Apart from PolMine, several further projects have worked on preparing respective corpora. There is an ensuing dialogue among these projects, and GermaParl strives to contribute to this broader development, trying to serve as a example how reproducibility and quality control can be maintained during all steps of corpus preparation, and how corpora can be disseminated in a manner that lowers the barriers of entry for new users.
The resource is under active development. Apart from improving data quality and preparing updates, one important concern is to add further annotation layers:
Moving to Stanford CoreNLP as the core tool for the NLP pipe (a step that has been taken with version 1.1.0) provides the basis for adding sentence annotation, and a basic annotation of named entities in the future.
Classification based on a supervised learning approach will be superior to unsupervised learning. In a CLARIN-funded project, training data has been prepared to obtain the basis for a classification of debates according to a (somewhat expanded) classification scheme of the Comparative Agendas Project (CAP). An annotation of the hand-coded text passages will be included in the corpus.
Hopefully, GermaParl will make qualitative and quantitative research with German parliamentary debates more productive. A final note on user feedback: Improving data quality is an important concern of the PolMine Project. This is why the data is versioned. The resource will benefit from its community of users - feedback that helps to improve data quality is more than welcome!
While the raw data, the plenary protocols published by the German Bundestag, are in the public domain, the GermaParl corpus comes with a CC BY-SA license. That means:
BY - Attribution — You must give appropriate credit, provide a link to the license, and indicate if changes were made. You may do so in any reasonable manner, but not in any way that suggests the licensor endorses you or your use.
SA - ShareAlike — If you remix, transform, or build upon the material, you must distribute your contributions under the same license as the original.
See the CC Attribution-ShareAlike 4.0 Unported License for further explanations.
The 'GermaParl' R package and the 'GermaParl' corpus are two different pieces of research data, with different version numbers, document object identifiers (DOIs) and recommendations for quotation. If you use GermaParl for your research, maximum transparency on the tools you used is attained, when both the package and the corpus is quoted in your publication. To ensure the reproducibility of your research, it is more important to refer to and specify the corpus (including version, DOI) you used.
Blaette, Andreas (2020): GermaParl. Download and Augment the Corpus of Plenary Protocols of the German Bundestag. R package version 1.3.0. https://CRAN.R-project.org/package=GermaParl
Blaette, Andreas (2020): GermaParl. Linguistically Annotated and Indexed Corpus of Plenary Protocols of the German Bundestag. CWB corpus version 1.0.6. https://doi.org/10.5281/zenodo.3735141
NOTE: In an R session, you can get this recommendation how to quote GermaParl by calling the usual citation()
function:
citation("GermaParl")
knitr::kable(GermaParl::germaparl_by_lp, format = "markdown")
summary_row <- t(data.frame(colSums(GermaParl::germaparl_by_year))) rownames(summary_row) <- NULL content_rows <- GermaParl::germaparl_by_year rownames(content_rows) <- NULL tab <- rbind(content_rows, summary_row) tab[["year"]] <- as.character(tab[["year"]]) tab[nrow(tab), "year"] <- "TOTAL" tab[nrow(tab), "unknown_share"] <- round( sum(GermaParl::germaparl_by_year[["unknown_total"]]) / sum(GermaParl::germaparl_by_year[["size"]]), digits = 3 ) colnames(tab)[6:7] <- c("unknown (total)", "unknown (share)") knitr::kable(tab, format = "markdown")
Starting from the 17th legislative period, txt versions of plenary protocols were retrieved from the homepage of the German Bundestag. The following table reports the URLs that have been used to download txt versions of older plenary protocols.
urls <- list( c("1996", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/1996/index.htm"), c("1997", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/1997/index.htm"), c("1998", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/1998/index.htm"), c("1999", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/1999/index.htm"), c("2000", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/2000/index.htm"), c("2001", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/2001/index.htm"), c("2002", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/2002/index.html"), c("2003", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/2003/index.html"), c("2004", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/2004/index.html"), c("2005", "http://webarchiv.bundestag.de/archive/2005/1205/bic/plenarprotokolle/pp/2005/index.html"), c("2006", "http://webarchiv.bundestag.de/archive/2008/0912/bic/plenarprotokolle/pp/2006/index.html"), c("2007", "http://webarchiv.bundestag.de/archive/2008/0912/bic/plenarprotokolle/pp/2007/index.html"), c("2008", "http://webarchiv.bundestag.de/archive/2008/0912/bic/plenarprotokolle/pp/2008/index.html") ) tab <- do.call(rbind, urls) colnames(tab) <- c("year", "URL") knitr::kable(tab, format = "markdown")
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.