In mannymistry/sentiStock: What the Package Does (One Line, Title Case)

Learning outcomes

As part of this tutorial, you will learn how to:

Decompose your texts into bins (typically equal-sized containers of words) or sentences. -Compute sentiment with a variety of weighting schemes. -Create and use your own weighting scheme for a classification task.

knitr::opts_chunk$set(echo = TRUE)

library("data.table")    # package bringing in the data.table machinery
library("quanteda")      # package useful for text and corpus manipulation
library("sentometrics")  # package containing sentiment computation tools
library("lexicon")       # package with multiple lexicons
library(dplyr)

# Bring in Data on UVXY
vol_news <- do.call(rbind, lapply(c("TROX"), get_all_news, 90))
# Use the description as the text data set
news_from_source <- vol_news %>%
    select(id, publishedDate, description) %>%
    rename(date = publishedDate, texts = description) %>% 
  distinct()

#TODO: At Some point you need to remove stop words, city names etc.  
news_sento <- sentometrics::sento_corpus(news_from_source)

# Create 2 Lexicons 1 Base one on Absolute values
# Absolute values help you to determine sentimeent intensity

# Description onf jockers rinker lexicon
lexicon::hash_sentiment_jockers_rinker

lex <- lexicon::hash_sentiment_jockers_rinker

sentoLexicon <- sento_lexicons(list(baseLex = lex,
                                    absoluteLex = lex[, .(x = x, y = abs(y))]))
lapply(sentoLexicon, head)

#

A review of sentiment computation with sentometrics

When using the default settings (i.e., only specifying the how argument), the sentiment for each word within a text will be determined according to the provided lexicons. This word-level sentiment is then aggregated using the method defined by the how argument, aggregating up to the document level to form a sentiment value for the document.

sentiment_by_word <- compute_sentiment(news_sento, sentoLexicon, how = "proportional")
head(sentiment_by_word)

sentiment_by_sentence <- sentiment <- compute_sentiment(news_sento, sentoLexicon, how = "proportional", do.sentence = TRUE)
head(sentiment_by_sentence)

Bin the text

Hypothesis is that the beginning of a news article contains the most amount of sentiment.

volnews2Toks <- tokens(news_sento, remove_punct = TRUE, remove_numbers = TRUE)
volnews2Toks <- tokens_tolower(volnews2Toks)

head(volnews2Toks)

We now have a list of character vectors, one for each document. The second step is to split each of these vectors into a list of vectors, one vector representing one bin. The final structure will look like:

nBins <- 5

volnews2Bins <- as.list(1:nBins)
for (i in seq_along(volnews2Toks)) {
  volnews2Bins[[i]] <- lapply(parallel::splitIndices(length(volnews2Toks[[i]]), nBins),
                             function(x) volnews2Toks[[i]][x])
}
names(volnews2Bins) <- names(volnews2Toks)

head(volnews2Bins[[1]], 2)

Looking good! The last step is now to get the sentiment value for each bin. Implementing this approach with compute_sentiment() requires to cheat a little! We will trick compute_sentiment() into believing that bins are actually sentences. This is done using the tokens argument in the function call and passing to it the vonews2Bins object we just created. The function will treat each character vector in volnews2Bins as a sentence and compute sentiment for it.

sentiment <- compute_sentiment(news_sento, sentoLexicon, how = "proportional", do.sentence = TRUE,
                               tokens = volnews2Bins)
head(sentiment)

Plot what you got

You could bin by date every 5 days or something

par(mfrow = c(1, 2))

plot(sentiment[, .(s = mean(`absoluteLex--dummyFeature`)), by = sentence_id], type = "l",
     ylab = "Mean absolute sentiment", xlab = "Bin")

boxplot(sentiment$`absoluteLex--dummyFeature` ~ sentiment$sentence_id, ylab = "Absolute sentiment",
        xlab = "Bin", outline = FALSE, range = 0.5)

Computing sentiment with different weights

With our earlier computation of sentiment using do.sentences = TRUE, we computed sentiment for sentences and bins. Now, for our next application, we need to aggregate these sentences and bins sentiment into documents sentiment. One option is to aggregate() using one of the methods shown above. Note the use of do.full = FALSE to stop the aggregation at the document level (otherwise, it would directly aggregate up to a time series).

docsSentiment <- aggregate(sentiment, ctr_agg(howDocs = "equal_weight"), do.full = FALSE)

lapply(list(sentiment = sentiment, docsSentiment = docsSentiment), head)

Specify your own weights

Give less weight to bins of the sentence that already has a considerable amount of sentiment

w <- rep(1 / (nBins - 0.5), nBins)
w[3] <-  w[3] * 0.5

list(sum = sum(w), w = w)

Second, we create a function to aggregate bins based on our customized weights. This is nothing more than the sum of element-wise vector multiplications.

aggregate_bins <- function(x, w) sum(x * w)

aggregate_bins(c(1, 2, 3), c(0.5, 0.25, 0.25))

docsSentiment <- sentiment[, c(word_count = sum(word_count),
                               lapply(.SD, aggregate_bins, w = w)),
                           by = .(id, date),
                           .SDcols = tail(names(sentiment), -4)]

head(docsSentiment)

sentimentValues <- list()

sentimentValues$default <- compute_sentiment(news_sento, sentoLexicon, how = "proportional")
sentimentValues$uShaped <- compute_sentiment(news_sento, sentoLexicon, how = "UShaped")
sentimentValues$sentences <- compute_sentiment(news_sento, sentoLexicon, how = "proportional",
                                               do.sentence = TRUE)
sentimentValues$bins <- compute_sentiment(news_sento, sentoLexicon, tokens = volnews2Bins,
                                          how = "proportional", do.sentence = TRUE)

lapply(sentimentValues[c(1,3)], head, n = 3)


sentimentValues$sentences <- aggregate(sentimentValues$sentences,
                                       ctr_agg(howDocs = "equal_weight"), do.full = FALSE)

sentimentValues$sentences <- aggregate(sentimentValues$sentences, ctr_agg(howTime = c("linear"), by = "day", lag = 3)) 

sentimentValues$bins <- sentimentValues$bins[, c(word_count = sum(word_count),
                                                 lapply(.SD, aggregate_bins, w = w)),
                                             by = .(id, date),
                                             .SDcols = tail(names(sentiment), -4)]

lapply(sentimentValues[3:4], head, n = 3)


sentimentValues <- lapply(sentimentValues, function(x) x[order(`baseLex--dummyFeature`)])

head(sentimentValues$default)

This was all done at a document level sentiment approach.

Instead you want to aggregate sentiment next by day

Next up Add straightforward topic modelling functionality into the add_features() function (or as part of the sento_train() function).

Implement a sento_train() function to for instance generate a lexicon from a corpus.

library(ggplot2)

by_sentence_for_graph <- sentimentValues$sentences$time

head(by_sentence_for_graph)

by_sentence_for_graph %>% 
  ggplot(aes(x = date, y =`baseLex--dummyFeature`))+
  geom_point()