Learning outcomes
As part of this tutorial, you will learn how to:
knitr::opts_chunk$set(echo = TRUE)
library("data.table") # package bringing in the data.table machinery library("quanteda") # package useful for text and corpus manipulation library("sentometrics") # package containing sentiment computation tools library("lexicon") # package with multiple lexicons library(dplyr) # Bring in Data on UVXY vol_news <- do.call(rbind, lapply(c("TROX"), get_all_news, 90)) # Use the description as the text data set news_from_source <- vol_news %>% select(id, publishedDate, description) %>% rename(date = publishedDate, texts = description) %>% distinct() #TODO: At Some point you need to remove stop words, city names etc. news_sento <- sentometrics::sento_corpus(news_from_source) # Create 2 Lexicons 1 Base one on Absolute values # Absolute values help you to determine sentimeent intensity # Description onf jockers rinker lexicon lexicon::hash_sentiment_jockers_rinker lex <- lexicon::hash_sentiment_jockers_rinker sentoLexicon <- sento_lexicons(list(baseLex = lex, absoluteLex = lex[, .(x = x, y = abs(y))])) lapply(sentoLexicon, head) #
A review of sentiment computation with sentometrics
When using the default settings (i.e., only specifying the how argument), the sentiment for each word within a text will be determined according to the provided lexicons. This word-level sentiment is then aggregated using the method defined by the how argument, aggregating up to the document level to form a sentiment value for the document.
sentiment_by_word <- compute_sentiment(news_sento, sentoLexicon, how = "proportional") head(sentiment_by_word) sentiment_by_sentence <- sentiment <- compute_sentiment(news_sento, sentoLexicon, how = "proportional", do.sentence = TRUE) head(sentiment_by_sentence)
Hypothesis is that the beginning of a news article contains the most amount of sentiment.
volnews2Toks <- tokens(news_sento, remove_punct = TRUE, remove_numbers = TRUE) volnews2Toks <- tokens_tolower(volnews2Toks) head(volnews2Toks)
We now have a list of character vectors, one for each document. The second step is to split each of these vectors into a list of vectors, one vector representing one bin. The final structure will look like:
nBins <- 5 volnews2Bins <- as.list(1:nBins) for (i in seq_along(volnews2Toks)) { volnews2Bins[[i]] <- lapply(parallel::splitIndices(length(volnews2Toks[[i]]), nBins), function(x) volnews2Toks[[i]][x]) } names(volnews2Bins) <- names(volnews2Toks) head(volnews2Bins[[1]], 2)
Looking good! The last step is now to get the sentiment value for each bin. Implementing this approach with compute_sentiment() requires to cheat a little! We will trick compute_sentiment() into believing that bins are actually sentences. This is done using the tokens argument in the function call and passing to it the vonews2Bins object we just created. The function will treat each character vector in volnews2Bins as a sentence and compute sentiment for it.
sentiment <- compute_sentiment(news_sento, sentoLexicon, how = "proportional", do.sentence = TRUE, tokens = volnews2Bins) head(sentiment)
Plot what you got
You could bin by date every 5 days or something
par(mfrow = c(1, 2)) plot(sentiment[, .(s = mean(`absoluteLex--dummyFeature`)), by = sentence_id], type = "l", ylab = "Mean absolute sentiment", xlab = "Bin") boxplot(sentiment$`absoluteLex--dummyFeature` ~ sentiment$sentence_id, ylab = "Absolute sentiment", xlab = "Bin", outline = FALSE, range = 0.5)
With our earlier computation of sentiment using do.sentences = TRUE, we computed sentiment for sentences and bins. Now, for our next application, we need to aggregate these sentences and bins sentiment into documents sentiment. One option is to aggregate() using one of the methods shown above. Note the use of do.full = FALSE to stop the aggregation at the document level (otherwise, it would directly aggregate up to a time series).
docsSentiment <- aggregate(sentiment, ctr_agg(howDocs = "equal_weight"), do.full = FALSE) lapply(list(sentiment = sentiment, docsSentiment = docsSentiment), head)
Give less weight to bins of the sentence that already has a considerable amount of sentiment
w <- rep(1 / (nBins - 0.5), nBins) w[3] <- w[3] * 0.5 list(sum = sum(w), w = w)
Second, we create a function to aggregate bins based on our customized weights. This is nothing more than the sum of element-wise vector multiplications.
aggregate_bins <- function(x, w) sum(x * w) aggregate_bins(c(1, 2, 3), c(0.5, 0.25, 0.25)) docsSentiment <- sentiment[, c(word_count = sum(word_count), lapply(.SD, aggregate_bins, w = w)), by = .(id, date), .SDcols = tail(names(sentiment), -4)] head(docsSentiment)
sentimentValues <- list() sentimentValues$default <- compute_sentiment(news_sento, sentoLexicon, how = "proportional") sentimentValues$uShaped <- compute_sentiment(news_sento, sentoLexicon, how = "UShaped") sentimentValues$sentences <- compute_sentiment(news_sento, sentoLexicon, how = "proportional", do.sentence = TRUE) sentimentValues$bins <- compute_sentiment(news_sento, sentoLexicon, tokens = volnews2Bins, how = "proportional", do.sentence = TRUE) lapply(sentimentValues[c(1,3)], head, n = 3) sentimentValues$sentences <- aggregate(sentimentValues$sentences, ctr_agg(howDocs = "equal_weight"), do.full = FALSE) sentimentValues$sentences <- aggregate(sentimentValues$sentences, ctr_agg(howTime = c("linear"), by = "day", lag = 3)) sentimentValues$bins <- sentimentValues$bins[, c(word_count = sum(word_count), lapply(.SD, aggregate_bins, w = w)), by = .(id, date), .SDcols = tail(names(sentiment), -4)] lapply(sentimentValues[3:4], head, n = 3) sentimentValues <- lapply(sentimentValues, function(x) x[order(`baseLex--dummyFeature`)]) head(sentimentValues$default)
This was all done at a document level sentiment approach.
Instead you want to aggregate sentiment next by day
Next up Add straightforward topic modelling functionality into the add_features() function (or as part of the sento_train() function).
Implement a sento_train() function to for instance generate a lexicon from a corpus.
library(ggplot2) by_sentence_for_graph <- sentimentValues$sentences$time head(by_sentence_for_graph) by_sentence_for_graph %>% ggplot(aes(x = date, y =`baseLex--dummyFeature`))+ geom_point()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.