In yunching/tidymas: A collection of RMD utilities

#knitr::opts_chunk$set(echo = FALSE)

Executive summary

In the report below, we performed text sentiment analysis on the ECB press conference speeches. Four different sentiment dictionaries were used for comparison, with words identified as positive contributing a positive score and negative words contributing a negative score. The analysis is done on unigrams (single word) basis, with adjustment for negation words through bigrams. The sentiment indicator is then built up from the summation of the scores and its change measured across time.

knitr::opts_chunk$set(echo = FALSE)
library(dplyr)
library(tidytext)
library(lubridate)
library(ggplot2)
library(tidyverse)
library(GGally)
source("text_analytics.R")  # Stores many of the functions used below

Methodology

Read in ECB speeches as scraped

The transcript of ECB press conference speeches were separately scraped from the ECB website, and we start by reading them into dataframes. We filter out text which are questions asked by the press audience

df <- read.csv("data/ecb_speeches.csv", stringsAsFactors = FALSE) %>%
  mutate(date = ymd(date), type = as.factor(type), speaker = as.factor(speaker))

# Extract press speech and answers
df_ecb <- df %>%
  filter(type %in% c("speech", "answer")) %>%
  group_by(date) %>%
  summarise(text = paste0(text, collapse = "."))

str(df_ecb)

Preprocess: Tokenising words and removing stop words

The next step is to convert paragraphs of text to individual words, in a process known as tokenising. We start by analysing single words, also referred to as unigrams.

In addition, we remove stop words which are words used very often which convey no meaning such as "is", "the" etc. We added our own custom stop words to remove years.

Performing a simple count analysis on the unigrams, we observe that words like "euro", "growth" are used very often in the speeches.

#Remove years as stop words
custom_stop_words = bind_rows(data_frame(word = c(as.character(1980:2030)),
                                          lexicon = c("custom")),
                               stop_words)

ecb_unigrams <- df_ecb %>%
  make_ngrams(custom_stop_words = custom_stop_words) 

ecb_unigrams %>%
  count_ngrams %>%
  filter(n > 1000) %>%
  mutate(word = reorder(word, n)) %>%
  ggplot(aes(word, n)) +
    geom_col() +
    xlab(NULL) +
    coord_flip()

Single word sentiment analysis

List of top words and their associated sentiments

After obtaining the count of the words, we now need to find their associated sentiments. We use four dictionaries to measure sentiment, however these dictionaries are not optimised for central bank or macroeconomic analysis and is sub optimal.

| Dictionary | Description | |------------|-------------| | AFINN | Scale of -5 (very negative) to +5 (very positive) | | Loughran | For financial applications, various types of sentiments | | NRC | Crowd sourced emotion-based | | Bing | Labels as positive or negative |

From an extract of the top words with associated sentiments, we see that the sentiments associated with each dictionary is quite varied.

# Compare how words categorization differs across sets
ecb_unigrams %>%
  count_ngrams %>%
  add_sentiments(c("afinn", "loughran", "nrc", "bing")) %>%
  compare_words_sentiment

Calculating a sentiment score

We then calculate a sentiment score by adding to the score when a positive word appears and subtracting when a negative word appears. In the case of the AFINN dictionary, we multiple the word count n by the sentiment score before summing them together. Here we observe that each indicator is fairly diferent, suggesting that the dictionary has significant impact on your final results.

Using zero as a baseline, we observe that most of the dictionaries tend to give a positive overall sentiment to the minutes, but Loughran views the speeches negatively in most cases

ecb_scores <- ecb_unigrams %>%
  add_sentiments(all = TRUE) %>%
  group_by(date) %>%
  calc_sentiment_score(wt = "n")

g <- ecb_scores %>%  
  ggplot(aes(date, score, colour = method)) 
g + geom_line() + 
  geom_hline(yintercept = 0) +
  facet_wrap(~method, ncol = 1, scales = "free_y")

High scoring words from the dictionary

To account for the differences, we extracted the top contributors to scores both negative and positive across the dictionaries, and see that they are indeed picking out quite different words

ecb_unigrams %>%
  add_sentiments(all = TRUE) %>%
  group_by(word) %>%
  calc_sentiment_score(wt = "n") %>%
  group_by(method) %>%
  top_n(10, abs(score)) %>%
  ungroup %>%
  mutate(word = reorder(word, score)) %>% 
  ggplot(aes(word, score, fill=factor(score > 0))) +
    geom_col(show.legend = FALSE) +
    facet_wrap(~method, scales = "free_y") +
    coord_flip()

Calculate ranking for tf-idf

It is arguable that a pure word count is an accurate reflection of the sentiments in a set of minutes. As words like growth or inflation may be used very often in any ECB speech. Hence we may wish to adjust for the frequency of these terms appearing in the documents.

We use term frequency - inverse document frequency (tf-idt) as a weight. tf-idf measures how unique a particular word is in the other documents. If it is relatively unique, it is given a higher weight, while if it appears frequently in every single document, it would be given a smaller weight.

Below we see words with high weights due to the uniqueness of their appearance for each of the years

unigrams_wt <- ecb_unigrams %>%
  group_by(date) %>%
  count_ngrams %>%
  bind_tf_idf(word, date, n) %>%
  ungroup

unigrams_wt %>% 
  ungroup %>%
  filter(year(date) > 2008) %>%
  group_by(year = as.factor(year(date))) %>% 
  top_n(15, tf_idf) %>% 
  ungroup %>%
  ggplot(aes(factor(word, levels = rev(unique(word))), tf_idf, fill = year)) +
    geom_col(show.legend = FALSE) +
    labs(x = NULL, y = "tf-idf") +
    facet_wrap(~year, ncol = 3, scales = "free") +
    coord_flip()

Calculation of weighted scores

With the weights, we do a multiplication of the weights with the sentiment to give us the weighted score. To compare with the unweighted scores, we standardize both the weighted and unweighted scores to compare them on the same scale. We see that directionally, both looks quite similar.

weighted_scores <- unigrams_wt %>%
  add_sentiments(all = TRUE) %>%
  group_by(date, method) %>%
  calc_sentiment_score(wt = "tf_idf") 

combined_scores <- bind_rows(ecb_scores %>% mutate(weight = "unwt"),
                            weighted_scores %>% mutate(weight = "wt"))

# Centre all scores and plot weight and unweighted on the same chart
combined_scores %>%
  group_by(method, weight) %>%
  mutate(score = scale(score)) %>%
  ggplot(aes(date, score, colour = weight)) +
  geom_line() +
  facet_wrap(~method, ncol = 1, scales = "free_y") +
  geom_hline(yintercept = 0, lwd=0.3, lty = 5) +
  ggtitle("Standardized sentiment scores")

Top contributors to weighted score

Here are the list of top contributors to weighted score for each of the recent years since 2011

unigrams_wt %>%
  add_sentiments(all = TRUE) %>%
  group_by(date, word) %>%
  calc_sentiment_score(wt = "tf_idf") %>%
  group_by(year = year(date), word, method) %>%
  summarise(score = sum(score)) %>%
  group_by(year, method) %>%
  filter(year > 2011) %>%
  top_n(10, abs(score)) %>%
  ungroup %>%
  ggplot(aes(factor(word, levels = unique(word[order(score)])), score, fill = factor(sign(score)))) +
    geom_col(show.legend = FALSE) +
    labs(x = NULL, y = "tf-idf") +
    facet_grid(year~method, scales = "free") +
    coord_flip()

Diagnostics of using different dictionaries

Comparison of weighted scores across the different dictionaries

Correlations are quite weak across dictionaries as well as when considering weighted or unweighted

combined_scores %>%
  mutate(method = paste(method, weight)) %>%
  select(-negative, -positive, -weight) %>%
  spread(method, score) %>%
  select(-date) %>%
  ggpairs

Adjusting for bigrams

Removing negation words

Unigram analysis may not be sufficient, as there are words which are used with opposite intentions, such as "negative growth" or "no money". Hence we extend the analysis to bigrams, which are essentially words paired together with their adjacent words.

From an algorithm perspective, you may have noticed that to do such bigram analysis, we cannot remove stop words when tokenizing, as words like "not", "no" are considered stop words.

We identify common negation words such as "not", "no" etc.

From the identified word list, we do see that in the unigram analysis, we would have misclassified the sentiment of these words

bigrams_separated <- df_ecb %>%
  make_ngrams(2, remove_stop_words = FALSE)

negation_words <- c("not", "no", "never", "without", "negative", "weak")

negated_words <- bigrams_separated %>%
  filter(word1 %in% negation_words) %>%
  ungroup()

negated_words %>% 
  rename(word = word2) %>% 
  add_sentiments %>% 
  filter(sentiment %in% c("positive", "negative"))

Adjusting for negated words

To adjust the negated words, we have to first remove its original impact, then add the new impact, which is equivalent to adding twice its new established impact

##################
# adjust weighted scores
##################

# Calculate negative scores from negated words
negative_scores_by_word <- negated_words %>% 
  rename(n2 = n, word = word2) %>%
  inner_join(unigrams_wt, by = c("date", "word")) %>%
  mutate(tf_idf2 = (n2 * 2 * -1)/n * tf_idf) %>%  # We subtract twice to adjust for the "wrong" impact on the raw score
  add_sentiments(all = TRUE) %>%
  group_by(date, word, word1) %>%
  calc_sentiment_score(wt = "tf_idf2") 

# Total negative scores by date
negative_scores <- negative_scores_by_word %>%
  group_by(date, method) %>%
  summarise(score = sum(score))

# Join negative scores to original weighted scores, and recalculated combined scores
bigram_adjusted_scores <- negative_scores %>%
 # select(-negative, -positive) %>%
  rename(score2 = score) %>%
  right_join(weighted_scores, by = c("date", "method")) %>%
  mutate(score3 = ifelse(is.na(score2), score, score + score2))

# Plot combined weighted scores scores
bigram_adjusted_scores %>% select(date, method, score, score2, score3) %>%
  gather(score_type, scores, -date, -method) %>%
  ggplot(aes(date, scores, colour = score_type)) +
  geom_line() +
  facet_wrap(~method, ncol=1, scales = "free_y") +
  geom_hline(yintercept = 0, lwd=0.3, lty = 5) 

# Plot words with negated contributions
# negative_scores_by_word %>%
#   group_by(word, word1) %>%
#   summarise(score = sum(score)) %>%
#   top_n(20, desc(abs(score))) %>%
#   mutate(full_word = paste(word1, word)) %>%
#   ggplot(aes(score, full_word, fill = score > 0)) %>%
#     geom_col() %>%
#     coord_flip()



####################
# adjust unweighted scores
####################

#negated_words %>% 

# weighted_scores
# weighted_scores %>% inner_join(negative_scores, by = c("date", "method"))
# 
# negated_words %>%
#   rename(word = word2) %>%
#   add_sentiments(all = TRUE) %>%
#   group_by(date, word1, word) %>%
#   calc_sentiment_score(wt = "") %>%
#   mutate(contribution = n * score) %>%
#   group_by(word1) %>%
#   top_n(10, desc(abs(contribution))) %>%
#   arrange(contribution) %>%
#   ungroup %>%
#   mutate(word2 = reorder(word2, contribution)) %>%
#   ggplot(aes(word2, n * score, fill = n * score > 0)) +
#   geom_col(show.legend = FALSE) +
#   facet_wrap(~word1, ncol = 2, scales = "free_y") +
#   xlab("Words preceded by \"not\"") +
#   ylab("Sentiment score * number of occurrences") +
#   coord_flip() +
#   scale_color_manual(labels = c("original", "negatives", "adjusted"), values = c("red", "green", "blue"))

bonds_raw <- read.csv("data/asset_data.csv")

bonds_data <- bonds_raw %>%
  mutate_at(.vars = vars(-Date), .funs = funs(ret = lead(.,0)/lag(., 1)-1)) %>%
  mutate(date = mdy(Date)) %>%
  select(date, ends_with("ret"))

all_data <- bigram_adjusted_scores %>%
  group_by(method) %>%
  mutate(adj_score = score3) %>%
  select(date, method, adj_score) %>%
  mutate(adj_score = c(NA,diff(adj_score))) %>%    # Measure changes in sentiment between speeches
  spread(method, -date) %>%
  left_join(bonds_data, by = c("date")) %>%
  na.omit

all_data %>%
  ungroup %>%
  select(-date) %>%
  ggpairs

ecb_scores %>%
  group_by(method) %>%
  select(date, method, score) %>%
  mutate(score = score - lag(score, 1)) %>%
  spread(method, -date) %>%
  left_join(bonds_data,  by = c("date")) %>%
  na.omit %>%
  select(-date) %>%
  ggpairs

Other text analytics tools

Visualizing network of bigrams

Other functionalities for future work include chaining of bigrams, which allow for deeper sentence analysis, but can be complicated to execute.

df_ecb %>%
  make_ngrams(2) %>%
  count_ngrams() %>%
  visualize_bigrams(min.n = 100)

Counting and correlation among sections

We calculated the correlation of words within years to try and identify whether there is a change in how words are used across the years. But there seems to be no useful results.

Another consideration is that using the count of words within each set of speech is not sufficient to accurately reflect correlation. We might want to consider calculating the correlation of words within paragraphs.

ecb_year_words <- ecb_unigrams %>%
  mutate(section = year(date))

library(widyr)

# Calculating annual correlation
word_cors <- bind_rows(
  lapply(unique(ecb_year_words$section), 
         function(x) {
           ecb_year_words %>%
             filter(section == x) %>%
             group_by(word) %>%
             filter(n >= 10) %>%
             pairwise_cor(word, date, n, sort = TRUE, use = "complete.obs") %>%
             mutate(section = x)
}))

word_cors %>%
  filter(item1 %in% c("growth", "inflation")) %>%
  group_by(item1, section) %>%
  top_n(6, desc(abs(correlation))) %>%
  ungroup() %>%
  arrange(desc(correlation)) %>%
  mutate(item2 = reorder(item2, correlation)) %>%
  filter(section > 2010) %>%
  ggplot(aes(item2, correlation)) +
    geom_bar(stat = "identity") +
    facet_wrap(~ section + item1, ncol = 2, scales = "free") +
    coord_flip()

yunching/tidymas documentation built on Feb. 5, 2023, 1:42 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

yunching/tidymas
A collection of RMD utilities

In yunching/tidymas: A collection of RMD utilities

Executive summary

Methodology

Read in ECB speeches as scraped

Preprocess: Tokenising words and removing stop words

Single word sentiment analysis

List of top words and their associated sentiments

Calculating a sentiment score

High scoring words from the dictionary

Calculate ranking for tf-idf

Calculation of weighted scores

Top contributors to weighted score

Diagnostics of using different dictionaries

Comparison of weighted scores across the different dictionaries

Adjusting for bigrams

Removing negation words

Adjusting for negated words

Other text analytics tools

Visualizing network of bigrams

Counting and correlation among sections

R Package Documentation

Browse R Packages

We want your feedback!

yunching/tidymas A collection of RMD utilities

In yunching/tidymas: A collection of RMD utilities

Executive summary

Methodology

Read in ECB speeches as scraped

Preprocess: Tokenising words and removing stop words

Single word sentiment analysis

List of top words and their associated sentiments

Calculating a sentiment score

High scoring words from the dictionary

Calculate ranking for tf-idf

Calculation of weighted scores

Top contributors to weighted score

Diagnostics of using different dictionaries

Comparison of weighted scores across the different dictionaries

Adjusting for bigrams

Removing negation words

Adjusting for negated words

Other text analytics tools

Visualizing network of bigrams

Counting and correlation among sections

R Package Documentation

Browse R Packages

We want your feedback!

yunching/tidymas
A collection of RMD utilities