In bockstaller/dbtprotokoll: Parses Protocols From The German Bundestag

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.width = 8,
  fig.heigth = 6,
  out.width = "100%",
  out.height = "100%",
  dpi = 600,
  fig.align='center', 
  message=FALSE, 
  warning=FALSE
)

library(dbtprotokoll)
library(tidytext)
library(stopwords)
library(ggplot2)
library(tidyr)
library(tibble)
library(dplyr)
library(jsonlite)
library(curl)
library(magrittr)
library(corrplot)
library(ggcorrplot)

Hypothesis: The political fractions use a different tone and words during their speeches. This should be observable in our data.

We start by loading a protocol dataset consisting out of previously parsed xml-protocols from disk into our global environment.

load("dataset.RData")

Enriching and tyding data

We join our paragraphs- and speaker-tibbles by their speaker id before tidying up the data. Tidying up this dataset means pivoting the speech content in such a way, that every row contains only one word. This process blows up the row count from ~250.000 rows up to 14.100.000 rows, which makes this sequence of steps necessary to reduce the size of the tibbles that must be joined.

We start with some simple exploratory steps to get a feeling for our dataset. So we count word occurences and graph them in a simple stacked graph.

speech <- protocols$paragraphs

# we have to clean up protocols$speakers ids, there are some ids which are used multiple times, 
speech_and_speaker <- merge(x = protocols$paragraphs, 
                            y = protocols$speakers, 
                            by.x = "speaker_id", 
                            by.y = "id", 
                            all.x = TRUE) 

tidy_speeches <- speech_and_speaker %>% 
  drop_na(fraktion) %>%
  unnest_tokens(word, content)



raw_word_occurences <- tidy_speeches %>%
  count(word, sort = TRUE)

raw_word_occurences %>%
  top_n(40) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Let's remove common stop words, as provided by the stopwords package, from the speeches.

stop_german <- data.frame(word = stopwords::stopwords("german", 
                                                      source = "stopwords-iso"), 
                          stringsAsFactors = FALSE)

tidy_speeches_no_stop <- tidy_speeches %>% anti_join(stop_german, by = c("word"))

word_occurences <- tidy_speeches_no_stop %>%
  count(word, sort = TRUE)

word_occurences %>%
  top_n(40) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

There is a lot of noise in this graph. Words that are used in a greeting or are just part of the procedure. Let's remove the worst offenders from the dataset.

parliamentary_stop_words <-as_tibble(c("herr", "herren", "damen", "frau", 
                                       "fraktion", "antrag", "kollegen", 
                                       "kolleginnen", "liebe", "kollege", 
                                       "kollegin", "frage", "präsident", 
                                       "geehrte", "geehrten", "herzlich", 
                                       "herzliches", "herzlichen", "dr"))

tidy_speeches_no_parliamentary_stop <- tidy_speeches_no_stop %>% 
                                       anti_join(parliamentary_stop_words, 
                                                 by = c("word" = "value"))

word_occurences <- tidy_speeches_no_parliamentary_stop %>%
  count(word, sort = TRUE)

word_occurences %>%
  top_n(40) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
  geom_col() +
  xlab(NULL) +
  coord_flip()

Much better. A computational linguist might be able to drop only common phrases from the dataset instead of dropping words alltogether. But we are currently limited to this more blunt approach.

Word occurences per party

Let's compare the different fractions. Can we identify significant differences in the word usage?

word_occurences_per_party <- tidy_speeches_no_parliamentary_stop  %>%
  select(fraktion, word) %>%
  group_by(fraktion) %>%
  count(word, sort=TRUE)

top_words_per_party <- word_occurences_per_party %>%
  top_n(5) %>%
  group_split()

top_words_per_party

We can see differences but, this approach is limited by the wildly varying n-values. This makes it difficult to draw conclusions, hence we need a more relative approach.

We start of by reshaping our data into a wider format, having one word per row and the matching occurence counts in a column per fraction. We drop the fraction "fraktionslos" as well. Speeches from this "fraction" contain mostly speeches from external speakers. Their very low n-values carry the risk of skewing the results easily.

coalesce_all_columns <- function(df) {
  return(coalesce(!!! as.list(df)))
}

word_occurences_per_party_wide <- word_occurences_per_party %>%
  pivot_wider(names_from=fraktion, values_from=n) %>%
  select(-fraktionslos) %>%
  ungroup() %>%
  group_by(word) %>% 
  summarise_all(coalesce_all_columns) %>%
  na.omit()

word_occurences_per_party_wide %>% 
  arrange(desc(`cdu/csu`), desc(afd), desc(spd), desc(fdp), desc(dielinke), desc(`bündnis90/diegrünen`))  %>% 
  head(10)

We use the chi-square-test and its standardized residuals to test for correlation between words and fractions. A positive residual shows that the observed occurences are bigger than expected if the data were random. Therefore we can sort the words by residual per fraction to see which words are used especially often by a fraction.

chi_data <- word_occurences_per_party_wide %>% as.data.frame()
rownames(chi_data) = chi_data$word

chisq <- chisq.test(chi_data[-1])

res = as.tibble(chisq$stdres, rownames = "word") 
names = res$word

entries = 20

top_afd <- res %>%
  arrange(desc(afd)) %>%
  head(entries) %>%
  column_to_rownames(var = "word") %>%
  as.matrix()

plot_matrix <- function(x, name){
  return(ggcorrplot(x, title = name) + 
           scale_fill_gradient2(limit = c(min(x)*1.5, max(x)*1.5), 
                                low = "blue", 
                                high = "red", 
                                mid = "white", 
                                midpoint = mean(x)))
}

afd_plot <- plot_matrix(top_afd, "AfD")
afd_plot

This plot shows that the AfD is using words like "Bürger", "Merkel", "Altparteien", "Regierung", "EU", "Migraten", usw. more often than every other fraction. Similiar graphs for other parties are created below.

top_cdu <- res %>%
  arrange(desc(`cdu/csu`)) %>%
  head(entries) %>%
  column_to_rownames(var = "word") %>%
  as.matrix()

cdu_plot <- plot_matrix(top_cdu, "CDU")

top_spd <- res %>%
  arrange(desc(`spd`)) %>%
  head(entries) %>%
  column_to_rownames(var = "word") %>%
  as.matrix()

spd_plot <- plot_matrix(top_spd, "SPD")


top_fdp <- res %>%
  arrange(desc(`fdp`)) %>%
  head(entries) %>%
  column_to_rownames(var = "word") %>%
  as.matrix()

fdp_plot <- plot_matrix(top_fdp, "FDP")

top_gruene <- res %>%
  arrange(desc(`bündnis90/diegrünen`)) %>%
  head(entries) %>%
  column_to_rownames(var = "word") %>%
  as.matrix()

gruene_plot <- plot_matrix(top_gruene, "Bündnis 90 / Die Grünen")

top_dielinke <- res %>%
  arrange(desc(`dielinke`)) %>%
  head(entries) %>%
  column_to_rownames(var = "word") %>%
  as.matrix()

dielinke_plot <- plot_matrix(top_dielinke, "Die Linke")

cdu_plot 
spd_plot 
fdp_plot 
gruene_plot
dielinke_plot