#knitr::opts_chunk$set(echo = FALSE)
In the report below, we performed text sentiment analysis on the ECB press conference speeches. Four different sentiment dictionaries were used for comparison, with words identified as positive contributing a positive score and negative words contributing a negative score. The analysis is done on unigrams (single word) basis, with adjustment for negation words through bigrams. The sentiment indicator is then built up from the summation of the scores and its change measured across time.
knitr::opts_chunk$set(echo = FALSE) library(dplyr) library(tidytext) library(lubridate) library(ggplot2) library(tidyverse) library(GGally) source("text_analytics.R") # Stores many of the functions used below
The transcript of ECB press conference speeches were separately scraped from the ECB website, and we start by reading them into dataframes. We filter out text which are questions asked by the press audience
df <- read.csv("data/ecb_speeches.csv", stringsAsFactors = FALSE) %>% mutate(date = ymd(date), type = as.factor(type), speaker = as.factor(speaker)) # Extract press speech and answers df_ecb <- df %>% filter(type %in% c("speech", "answer")) %>% group_by(date) %>% summarise(text = paste0(text, collapse = ".")) str(df_ecb)
The next step is to convert paragraphs of text to individual words, in a process known as tokenising. We start by analysing single words, also referred to as unigrams.
In addition, we remove stop words which are words used very often which convey no meaning such as "is", "the" etc. We added our own custom stop words to remove years.
Performing a simple count analysis on the unigrams, we observe that words like "euro", "growth" are used very often in the speeches.
#Remove years as stop words custom_stop_words = bind_rows(data_frame(word = c(as.character(1980:2030)), lexicon = c("custom")), stop_words) ecb_unigrams <- df_ecb %>% make_ngrams(custom_stop_words = custom_stop_words) ecb_unigrams %>% count_ngrams %>% filter(n > 1000) %>% mutate(word = reorder(word, n)) %>% ggplot(aes(word, n)) + geom_col() + xlab(NULL) + coord_flip()
After obtaining the count of the words, we now need to find their associated sentiments. We use four dictionaries to measure sentiment, however these dictionaries are not optimised for central bank or macroeconomic analysis and is sub optimal.
| Dictionary | Description | |------------|-------------| | AFINN | Scale of -5 (very negative) to +5 (very positive) | | Loughran | For financial applications, various types of sentiments | | NRC | Crowd sourced emotion-based | | Bing | Labels as positive or negative |
From an extract of the top words with associated sentiments, we see that the sentiments associated with each dictionary is quite varied.
# Compare how words categorization differs across sets ecb_unigrams %>% count_ngrams %>% add_sentiments(c("afinn", "loughran", "nrc", "bing")) %>% compare_words_sentiment
We then calculate a sentiment score by adding to the score when a positive word appears and subtracting when a negative word appears. In the case of the AFINN dictionary, we multiple the word count n
by the sentiment score before summing them together. Here we observe that each indicator is fairly diferent, suggesting that the dictionary has significant impact on your final results.
Using zero as a baseline, we observe that most of the dictionaries tend to give a positive overall sentiment to the minutes, but Loughran views the speeches negatively in most cases
ecb_scores <- ecb_unigrams %>% add_sentiments(all = TRUE) %>% group_by(date) %>% calc_sentiment_score(wt = "n") g <- ecb_scores %>% ggplot(aes(date, score, colour = method)) g + geom_line() + geom_hline(yintercept = 0) + facet_wrap(~method, ncol = 1, scales = "free_y")
To account for the differences, we extracted the top contributors to scores both negative and positive across the dictionaries, and see that they are indeed picking out quite different words
ecb_unigrams %>% add_sentiments(all = TRUE) %>% group_by(word) %>% calc_sentiment_score(wt = "n") %>% group_by(method) %>% top_n(10, abs(score)) %>% ungroup %>% mutate(word = reorder(word, score)) %>% ggplot(aes(word, score, fill=factor(score > 0))) + geom_col(show.legend = FALSE) + facet_wrap(~method, scales = "free_y") + coord_flip()
It is arguable that a pure word count is an accurate reflection of the sentiments in a set of minutes. As words like growth or inflation may be used very often in any ECB speech. Hence we may wish to adjust for the frequency of these terms appearing in the documents.
We use term frequency - inverse document frequency (tf-idt) as a weight. tf-idf measures how unique a particular word is in the other documents. If it is relatively unique, it is given a higher weight, while if it appears frequently in every single document, it would be given a smaller weight.
Below we see words with high weights due to the uniqueness of their appearance for each of the years
unigrams_wt <- ecb_unigrams %>% group_by(date) %>% count_ngrams %>% bind_tf_idf(word, date, n) %>% ungroup unigrams_wt %>% ungroup %>% filter(year(date) > 2008) %>% group_by(year = as.factor(year(date))) %>% top_n(15, tf_idf) %>% ungroup %>% ggplot(aes(factor(word, levels = rev(unique(word))), tf_idf, fill = year)) + geom_col(show.legend = FALSE) + labs(x = NULL, y = "tf-idf") + facet_wrap(~year, ncol = 3, scales = "free") + coord_flip()
With the weights, we do a multiplication of the weights with the sentiment to give us the weighted score. To compare with the unweighted scores, we standardize both the weighted and unweighted scores to compare them on the same scale. We see that directionally, both looks quite similar.
weighted_scores <- unigrams_wt %>% add_sentiments(all = TRUE) %>% group_by(date, method) %>% calc_sentiment_score(wt = "tf_idf") combined_scores <- bind_rows(ecb_scores %>% mutate(weight = "unwt"), weighted_scores %>% mutate(weight = "wt")) # Centre all scores and plot weight and unweighted on the same chart combined_scores %>% group_by(method, weight) %>% mutate(score = scale(score)) %>% ggplot(aes(date, score, colour = weight)) + geom_line() + facet_wrap(~method, ncol = 1, scales = "free_y") + geom_hline(yintercept = 0, lwd=0.3, lty = 5) + ggtitle("Standardized sentiment scores")
Here are the list of top contributors to weighted score for each of the recent years since 2011
unigrams_wt %>% add_sentiments(all = TRUE) %>% group_by(date, word) %>% calc_sentiment_score(wt = "tf_idf") %>% group_by(year = year(date), word, method) %>% summarise(score = sum(score)) %>% group_by(year, method) %>% filter(year > 2011) %>% top_n(10, abs(score)) %>% ungroup %>% ggplot(aes(factor(word, levels = unique(word[order(score)])), score, fill = factor(sign(score)))) + geom_col(show.legend = FALSE) + labs(x = NULL, y = "tf-idf") + facet_grid(year~method, scales = "free") + coord_flip()
Correlations are quite weak across dictionaries as well as when considering weighted or unweighted
combined_scores %>% mutate(method = paste(method, weight)) %>% select(-negative, -positive, -weight) %>% spread(method, score) %>% select(-date) %>% ggpairs
Unigram analysis may not be sufficient, as there are words which are used with opposite intentions, such as "negative growth" or "no money". Hence we extend the analysis to bigrams, which are essentially words paired together with their adjacent words.
From an algorithm perspective, you may have noticed that to do such bigram analysis, we cannot remove stop words when tokenizing, as words like "not", "no" are considered stop words.
We identify common negation words such as "not", "no" etc.
From the identified word list, we do see that in the unigram analysis, we would have misclassified the sentiment of these words
bigrams_separated <- df_ecb %>% make_ngrams(2, remove_stop_words = FALSE) negation_words <- c("not", "no", "never", "without", "negative", "weak") negated_words <- bigrams_separated %>% filter(word1 %in% negation_words) %>% ungroup() negated_words %>% rename(word = word2) %>% add_sentiments %>% filter(sentiment %in% c("positive", "negative"))
To adjust the negated words, we have to first remove its original impact, then add the new impact, which is equivalent to adding twice its new established impact
################## # adjust weighted scores ################## # Calculate negative scores from negated words negative_scores_by_word <- negated_words %>% rename(n2 = n, word = word2) %>% inner_join(unigrams_wt, by = c("date", "word")) %>% mutate(tf_idf2 = (n2 * 2 * -1)/n * tf_idf) %>% # We subtract twice to adjust for the "wrong" impact on the raw score add_sentiments(all = TRUE) %>% group_by(date, word, word1) %>% calc_sentiment_score(wt = "tf_idf2") # Total negative scores by date negative_scores <- negative_scores_by_word %>% group_by(date, method) %>% summarise(score = sum(score)) # Join negative scores to original weighted scores, and recalculated combined scores bigram_adjusted_scores <- negative_scores %>% # select(-negative, -positive) %>% rename(score2 = score) %>% right_join(weighted_scores, by = c("date", "method")) %>% mutate(score3 = ifelse(is.na(score2), score, score + score2)) # Plot combined weighted scores scores bigram_adjusted_scores %>% select(date, method, score, score2, score3) %>% gather(score_type, scores, -date, -method) %>% ggplot(aes(date, scores, colour = score_type)) + geom_line() + facet_wrap(~method, ncol=1, scales = "free_y") + geom_hline(yintercept = 0, lwd=0.3, lty = 5) # Plot words with negated contributions # negative_scores_by_word %>% # group_by(word, word1) %>% # summarise(score = sum(score)) %>% # top_n(20, desc(abs(score))) %>% # mutate(full_word = paste(word1, word)) %>% # ggplot(aes(score, full_word, fill = score > 0)) %>% # geom_col() %>% # coord_flip() #################### # adjust unweighted scores #################### #negated_words %>% # weighted_scores # weighted_scores %>% inner_join(negative_scores, by = c("date", "method")) # # negated_words %>% # rename(word = word2) %>% # add_sentiments(all = TRUE) %>% # group_by(date, word1, word) %>% # calc_sentiment_score(wt = "") %>% # mutate(contribution = n * score) %>% # group_by(word1) %>% # top_n(10, desc(abs(contribution))) %>% # arrange(contribution) %>% # ungroup %>% # mutate(word2 = reorder(word2, contribution)) %>% # ggplot(aes(word2, n * score, fill = n * score > 0)) + # geom_col(show.legend = FALSE) + # facet_wrap(~word1, ncol = 2, scales = "free_y") + # xlab("Words preceded by \"not\"") + # ylab("Sentiment score * number of occurrences") + # coord_flip() + # scale_color_manual(labels = c("original", "negatives", "adjusted"), values = c("red", "green", "blue"))
bonds_raw <- read.csv("data/asset_data.csv") bonds_data <- bonds_raw %>% mutate_at(.vars = vars(-Date), .funs = funs(ret = lead(.,0)/lag(., 1)-1)) %>% mutate(date = mdy(Date)) %>% select(date, ends_with("ret")) all_data <- bigram_adjusted_scores %>% group_by(method) %>% mutate(adj_score = score3) %>% select(date, method, adj_score) %>% mutate(adj_score = c(NA,diff(adj_score))) %>% # Measure changes in sentiment between speeches spread(method, -date) %>% left_join(bonds_data, by = c("date")) %>% na.omit all_data %>% ungroup %>% select(-date) %>% ggpairs ecb_scores %>% group_by(method) %>% select(date, method, score) %>% mutate(score = score - lag(score, 1)) %>% spread(method, -date) %>% left_join(bonds_data, by = c("date")) %>% na.omit %>% select(-date) %>% ggpairs
Other functionalities for future work include chaining of bigrams, which allow for deeper sentence analysis, but can be complicated to execute.
df_ecb %>% make_ngrams(2) %>% count_ngrams() %>% visualize_bigrams(min.n = 100)
We calculated the correlation of words within years to try and identify whether there is a change in how words are used across the years. But there seems to be no useful results.
Another consideration is that using the count of words within each set of speech is not sufficient to accurately reflect correlation. We might want to consider calculating the correlation of words within paragraphs.
ecb_year_words <- ecb_unigrams %>% mutate(section = year(date)) library(widyr) # Calculating annual correlation word_cors <- bind_rows( lapply(unique(ecb_year_words$section), function(x) { ecb_year_words %>% filter(section == x) %>% group_by(word) %>% filter(n >= 10) %>% pairwise_cor(word, date, n, sort = TRUE, use = "complete.obs") %>% mutate(section = x) })) word_cors %>% filter(item1 %in% c("growth", "inflation")) %>% group_by(item1, section) %>% top_n(6, desc(abs(correlation))) %>% ungroup() %>% arrange(desc(correlation)) %>% mutate(item2 = reorder(item2, correlation)) %>% filter(section > 2010) %>% ggplot(aes(item2, correlation)) + geom_bar(stat = "identity") + facet_wrap(~ section + item1, ncol = 2, scales = "free") + coord_flip()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.