knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.width = 8, fig.heigth = 6, out.width = "100%", out.height = "100%", dpi = 600, fig.align='center', message=FALSE, warning=FALSE )
library(dbtprotokoll) library(tidytext) library(stopwords) library(ggplot2) library(tidyr) library(tibble) library(dplyr) library(jsonlite) library(curl) library(magrittr) library(corrplot) library(ggcorrplot)
Hypothesis: The political fractions use a different tone and words during their speeches. This should be observable in our data.
We start by loading a protocol dataset consisting out of previously parsed xml-protocols from disk into our global environment.
load("dataset.RData")
We join our paragraphs- and speaker-tibbles by their speaker id before tidying up the data. Tidying up this dataset means pivoting the speech content in such a way, that every row contains only one word. This process blows up the row count from ~250.000 rows up to 14.100.000 rows, which makes this sequence of steps necessary to reduce the size of the tibbles that must be joined.
We start with some simple exploratory steps to get a feeling for our dataset. So we count word occurences and graph them in a simple stacked graph.
speech <- protocols$paragraphs # we have to clean up protocols$speakers ids, there are some ids which are used multiple times, speech_and_speaker <- merge(x = protocols$paragraphs, y = protocols$speakers, by.x = "speaker_id", by.y = "id", all.x = TRUE) tidy_speeches <- speech_and_speaker %>% drop_na(fraktion) %>% unnest_tokens(word, content) raw_word_occurences <- tidy_speeches %>% count(word, sort = TRUE) raw_word_occurences %>% top_n(40) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip()
Let's remove common stop words, as provided by the stopwords package, from the speeches.
stop_german <- data.frame(word = stopwords::stopwords("german", source = "stopwords-iso"), stringsAsFactors = FALSE) tidy_speeches_no_stop <- tidy_speeches %>% anti_join(stop_german, by = c("word")) word_occurences <- tidy_speeches_no_stop %>% count(word, sort = TRUE) word_occurences %>% top_n(40) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip()
There is a lot of noise in this graph. Words that are used in a greeting or are just part of the procedure. Let's remove the worst offenders from the dataset.
parliamentary_stop_words <-as_tibble(c("herr", "herren", "damen", "frau", "fraktion", "antrag", "kollegen", "kolleginnen", "liebe", "kollege", "kollegin", "frage", "präsident", "geehrte", "geehrten", "herzlich", "herzliches", "herzlichen", "dr")) tidy_speeches_no_parliamentary_stop <- tidy_speeches_no_stop %>% anti_join(parliamentary_stop_words, by = c("word" = "value")) word_occurences <- tidy_speeches_no_parliamentary_stop %>% count(word, sort = TRUE) word_occurences %>% top_n(40) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip()
Much better. A computational linguist might be able to drop only common phrases from the dataset instead of dropping words alltogether. But we are currently limited to this more blunt approach.
Let's compare the different fractions. Can we identify significant differences in the word usage?
word_occurences_per_party <- tidy_speeches_no_parliamentary_stop %>% select(fraktion, word) %>% group_by(fraktion) %>% count(word, sort=TRUE) top_words_per_party <- word_occurences_per_party %>% top_n(5) %>% group_split() top_words_per_party
We can see differences but, this approach is limited by the wildly varying n-values. This makes it difficult to draw conclusions, hence we need a more relative approach.
We start of by reshaping our data into a wider format, having one word per row and the matching occurence counts in a column per fraction. We drop the fraction "fraktionslos" as well. Speeches from this "fraction" contain mostly speeches from external speakers. Their very low n-values carry the risk of skewing the results easily.
coalesce_all_columns <- function(df) { return(coalesce(!!! as.list(df))) } word_occurences_per_party_wide <- word_occurences_per_party %>% pivot_wider(names_from=fraktion, values_from=n) %>% select(-fraktionslos) %>% ungroup() %>% group_by(word) %>% summarise_all(coalesce_all_columns) %>% na.omit() word_occurences_per_party_wide %>% arrange(desc(`cdu/csu`), desc(afd), desc(spd), desc(fdp), desc(dielinke), desc(`bündnis90/diegrünen`)) %>% head(10)
We use the chi-square-test and its standardized residuals to test for correlation between words and fractions. A positive residual shows that the observed occurences are bigger than expected if the data were random. Therefore we can sort the words by residual per fraction to see which words are used especially often by a fraction.
chi_data <- word_occurences_per_party_wide %>% as.data.frame() rownames(chi_data) = chi_data$word chisq <- chisq.test(chi_data[-1]) res = as.tibble(chisq$stdres, rownames = "word") names = res$word entries = 20 top_afd <- res %>% arrange(desc(afd)) %>% head(entries) %>% column_to_rownames(var = "word") %>% as.matrix() plot_matrix <- function(x, name){ return(ggcorrplot(x, title = name) + scale_fill_gradient2(limit = c(min(x)*1.5, max(x)*1.5), low = "blue", high = "red", mid = "white", midpoint = mean(x))) } afd_plot <- plot_matrix(top_afd, "AfD") afd_plot
This plot shows that the AfD is using words like "Bürger", "Merkel", "Altparteien", "Regierung", "EU", "Migraten", usw. more often than every other fraction. Similiar graphs for other parties are created below.
top_cdu <- res %>% arrange(desc(`cdu/csu`)) %>% head(entries) %>% column_to_rownames(var = "word") %>% as.matrix() cdu_plot <- plot_matrix(top_cdu, "CDU") top_spd <- res %>% arrange(desc(`spd`)) %>% head(entries) %>% column_to_rownames(var = "word") %>% as.matrix() spd_plot <- plot_matrix(top_spd, "SPD") top_fdp <- res %>% arrange(desc(`fdp`)) %>% head(entries) %>% column_to_rownames(var = "word") %>% as.matrix() fdp_plot <- plot_matrix(top_fdp, "FDP") top_gruene <- res %>% arrange(desc(`bündnis90/diegrünen`)) %>% head(entries) %>% column_to_rownames(var = "word") %>% as.matrix() gruene_plot <- plot_matrix(top_gruene, "Bündnis 90 / Die Grünen") top_dielinke <- res %>% arrange(desc(`dielinke`)) %>% head(entries) %>% column_to_rownames(var = "word") %>% as.matrix() dielinke_plot <- plot_matrix(top_dielinke, "Die Linke") cdu_plot spd_plot fdp_plot gruene_plot dielinke_plot
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.