In sborms/sentometrics: An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction

knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.width = 7, fig.height = 4, fig.align = "center")

This tutorial provides a guide on how to perform the textual sentiment computation.

Preparation

library("sentometrics")
library("quanteda")
library("tm")
library("stringi")
library("data.table")
library("lexicon")

data("usnews")
data("list_lexicons")
data("list_valence_shifters")

Ready-to-use sentiment calculation

A simple calculation of sentiment. Given that the two used lexicons are so-called binary, every final score, with the "counts" option, is the difference between the number of positive lexicon words (those with a score of 1) and the number of negative lexicon words (those with a score of --1) detected in the text.

s <- compute_sentiment(
  usnews[["texts"]],
  sento_lexicons(list_lexicons[c("GI_en", "LM_en")]),
  how = "counts"
)

s

Sentiment calculation from a `sento_corpus` object

The same simple calculation as above, but using a sento_corpus object and the metadata features in the corpus. A "date" variable is always part of any sento_corpus, and is also considered a docvar.

corpus <- sento_corpus(usnews)

corpus

lexicons <- sento_lexicons(list_lexicons[c("GI_en", "LM_en", "HENRY_en")])

s <- compute_sentiment(corpus, lexicons, how = "counts")

head(s)

Sentiment calculation from a `tm` `SimpleCorpus` object

Another simple sentiment calculation, this time using a tm package corpus object. Super flexible! The output is on purpose slightly different, as the scores are divided by the total number of words.

corpus <- SimpleCorpus(VectorSource(usnews[["texts"]]))

corpus

lexicons <- sento_lexicons(list_lexicons[c("GI_en", "LM_en", "HENRY_en")])

s <- compute_sentiment(corpus, lexicons, how = "proportional")

s

Sentiment calculation from own tokenization

This example showcases some more flexibility. You can tokenize your corpus outside the sentiment computation function call, to control exactly which words the lexicons are going to look into.

corpus <- sento_corpus(usnews)
tks <- as.list(tokens(corpus, what = "fastestword"))

lexicons <- sento_lexicons(list_lexicons[c("GI_en", "LM_en", "HENRY_en")])

compute_sentiment(as.character(corpus), lexicons, how = "counts", tokens = tks)

To provide your own tokenized input on sentence-level, beware that you need to provide a list of lists, and set do.sentence = TRUE. See one of the next examples for more info about sentence-level sentiment calculation.

sentences <- tokens(corpus, what = "sentence")
tks2 <- lapply(sentences, function(s) as.list(tokens(s, what = "word")))

compute_sentiment(as.character(corpus), lexicons[2:3], how = "counts", tokens = tks2, do.sentence = TRUE)

The three key approaches to the sentiment computation

We offer three main approaches to do the lexicon-based sentiment calculation: account only for unigrams (simple), consider valence shifting in a bigrams perspective (valence), or consider valence shifting in a cluster of words around a detected polarized word (cluster). Read the vignette for more details! Here we demonstrate how to plot the different approaches for comparison.

txts <- usnews[1:200, "texts"]

vals <- list_valence_shifters[["en"]]

lexValence <- sento_lexicons(list(nrc = hash_sentiment_nrc), vals[, c("x", "y")])
lexCluster <- sento_lexicons(list(nrc = hash_sentiment_nrc), vals[, c("x", "t")])

s1 <- compute_sentiment(txts, head(lexValence, -1))$nrc
s2 <- compute_sentiment(txts, lexValence)$nrc
s3 <- compute_sentiment(txts, lexCluster)$nrc
s <- cbind(simple = s1, valence = s2, cluster = s3)

matplot(s, type = "l", lty = 1, ylab = "Sentiment", xlab = "Text")
legend("topright", col = 1:3, legend = colnames(s), lty = 1, cex = 0.7, bty = "n")

Sentence-level sentiment calculation and aggregation

A textual sentiment computation on sentence-level, starting from a document-level corpus, and normalized dividing by the number of detected polarized words. Subsequently, the resulting sentence-level scores are aggregated into document-level scores.

corpus <- sento_corpus(usnews[, 1:3])

s <- compute_sentiment(
  corpus,
  sento_lexicons(list_lexicons["LM_en"]),
  how = "proportionalPol",
  do.sentence = TRUE
)

s

sDocs <- aggregate(s, ctr_agg(howDocs = "proportional"), do.full = FALSE)

sDocs

From these sentiment scores, we find the 4 documents where most positive sentiment scores were detected.

peakDocsPos <- peakdocs(sDocs, n = 4, type = "pos")

peakDocsPos

corpusPeaks <- corpus_subset(corpus, docnames(corpus) %in% peakDocsPos)

Tf-idf sentiment calculation as in the `quanteda` package

The term frequency-inverse document frequency statistic is widely used to quantify term importance in a corpus. Its use extends to sentiment calculation simply by adding the polarity of the words to the equation. This example shows that the tf-idf sentiment output from sentometrics is the same as the output obtained using the text mining package quanteda.

# ensure same tokenization for full comparability
txts <- usnews$texts[1:100]
toks <- stri_split_boundaries(stri_trans_tolower(txts), type = "word", 
                              skip_word_none = TRUE)

# pick a lexicon
lexIn <- list_lexicons$GI_en

# quanteda tf-idf sentiment calculation
dfmQ <- dfm(as.tokens(toks)) %>% dfm_tfidf(k = 1)

posWords <- lexIn[y == 1, x]
negWords <- lexIn[y == -1, x]

posScores <- rowSums(dfm_select(dfmQ, posWords))
negScores <- rowSums(dfm_select(dfmQ, negWords))
q <- unname(posScores - negScores)

# sentometrics tf-idf sentiment calculation
lex <- sento_lexicons(list(L = lexIn))

s <- compute_sentiment(txts, lex, how = "TFIDF", tokens = toks)[["L"]]

R they equal?

all.equal(q, s)

Multi-language sentiment computation

Multi-language textual sentiment analysis requires only a few modifications to the corpus and lexicons setup. One needs first to have a non-numeric "language" feature to be integrated into a sento_corpus object. This feature's task is to regulate the application of the input lexicons supplied in different languages to the texts based on their associated language tag.

corpusdt <- data.table(
  id = as.character(1:3),
  date = Sys.Date(),
  texts = c("Dit is goed. Bien, good.", "Çeci est bien. Goed, good.", "This is good. Goed, bien."),
  language = c("nl", "fr", "en"),
  ftr = 0.5 # a numeric feature
)
corpus <- sento_corpus(corpusdt)

corpus

lexicons <- list(nl = sento_lexicons(list("lexNL" = data.frame("goed", 1))), 
                 fr = sento_lexicons(list("lexFR" = data.frame("bien", 1))),
                 en = sento_lexicons(list("lexEN" = data.frame("good", 1))))

s <- compute_sentiment(corpus, lexicons, "counts")

s