knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.width = 7, fig.height = 4, fig.align = "center")
This tutorial provides a guide on how to perform the textual sentiment computation.
Preparation
library("sentometrics") library("quanteda") library("tm") library("stringi") library("data.table") library("lexicon") data("usnews") data("list_lexicons") data("list_valence_shifters")
A simple calculation of sentiment. Given that the two used lexicons are so-called binary, every final score, with the "counts"
option, is the difference between the number of positive lexicon words (those with a score of 1) and the number of negative lexicon words (those with a score of --1) detected in the text.
s <- compute_sentiment( usnews[["texts"]], sento_lexicons(list_lexicons[c("GI_en", "LM_en")]), how = "counts" ) s
sento_corpus
objectThe same simple calculation as above, but using a sento_corpus
object and the metadata features in the corpus. A "date"
variable is always part of any sento_corpus
, and is also considered a docvar.
corpus <- sento_corpus(usnews) corpus
lexicons <- sento_lexicons(list_lexicons[c("GI_en", "LM_en", "HENRY_en")]) s <- compute_sentiment(corpus, lexicons, how = "counts") head(s)
tm
SimpleCorpus
objectAnother simple sentiment calculation, this time using a tm
package corpus object. Super flexible! The output is on purpose slightly different, as the scores are divided by the total number of words.
corpus <- SimpleCorpus(VectorSource(usnews[["texts"]])) corpus
lexicons <- sento_lexicons(list_lexicons[c("GI_en", "LM_en", "HENRY_en")]) s <- compute_sentiment(corpus, lexicons, how = "proportional") s
This example showcases some more flexibility. You can tokenize your corpus outside the sentiment computation function call, to control exactly which words the lexicons are going to look into.
corpus <- sento_corpus(usnews) tks <- as.list(tokens(corpus, what = "fastestword")) lexicons <- sento_lexicons(list_lexicons[c("GI_en", "LM_en", "HENRY_en")]) compute_sentiment(as.character(corpus), lexicons, how = "counts", tokens = tks)
To provide your own tokenized input on sentence-level, beware that you need to provide a list
of list
s, and set do.sentence = TRUE
. See one of the next examples for more info about sentence-level sentiment calculation.
sentences <- tokens(corpus, what = "sentence") tks2 <- lapply(sentences, function(s) as.list(tokens(s, what = "word"))) compute_sentiment(as.character(corpus), lexicons[2:3], how = "counts", tokens = tks2, do.sentence = TRUE)
We offer three main approaches to do the lexicon-based sentiment calculation: account only for unigrams (simple), consider valence shifting in a bigrams perspective (valence), or consider valence shifting in a cluster of words around a detected polarized word (cluster). Read the vignette for more details! Here we demonstrate how to plot the different approaches for comparison.
txts <- usnews[1:200, "texts"] vals <- list_valence_shifters[["en"]] lexValence <- sento_lexicons(list(nrc = hash_sentiment_nrc), vals[, c("x", "y")]) lexCluster <- sento_lexicons(list(nrc = hash_sentiment_nrc), vals[, c("x", "t")]) s1 <- compute_sentiment(txts, head(lexValence, -1))$nrc s2 <- compute_sentiment(txts, lexValence)$nrc s3 <- compute_sentiment(txts, lexCluster)$nrc s <- cbind(simple = s1, valence = s2, cluster = s3) matplot(s, type = "l", lty = 1, ylab = "Sentiment", xlab = "Text") legend("topright", col = 1:3, legend = colnames(s), lty = 1, cex = 0.7, bty = "n")
A textual sentiment computation on sentence-level, starting from a document-level corpus, and normalized dividing by the number of detected polarized words. Subsequently, the resulting sentence-level scores are aggregated into document-level scores.
corpus <- sento_corpus(usnews[, 1:3]) s <- compute_sentiment( corpus, sento_lexicons(list_lexicons["LM_en"]), how = "proportionalPol", do.sentence = TRUE ) s
sDocs <- aggregate(s, ctr_agg(howDocs = "proportional"), do.full = FALSE) sDocs
From these sentiment scores, we find the 4 documents where most positive sentiment scores were detected.
peakDocsPos <- peakdocs(sDocs, n = 4, type = "pos") peakDocsPos
corpusPeaks <- corpus_subset(corpus, docnames(corpus) %in% peakDocsPos)
quanteda
packageThe term frequency-inverse document frequency statistic is widely used to quantify term importance in a corpus. Its use extends to sentiment calculation simply by adding the polarity of the words to the equation. This example shows that the tf-idf sentiment output from sentometrics
is the same as the output obtained using the text mining package quanteda
.
# ensure same tokenization for full comparability txts <- usnews$texts[1:100] toks <- stri_split_boundaries(stri_trans_tolower(txts), type = "word", skip_word_none = TRUE) # pick a lexicon lexIn <- list_lexicons$GI_en # quanteda tf-idf sentiment calculation dfmQ <- dfm(as.tokens(toks)) %>% dfm_tfidf(k = 1) posWords <- lexIn[y == 1, x] negWords <- lexIn[y == -1, x] posScores <- rowSums(dfm_select(dfmQ, posWords)) negScores <- rowSums(dfm_select(dfmQ, negWords)) q <- unname(posScores - negScores) # sentometrics tf-idf sentiment calculation lex <- sento_lexicons(list(L = lexIn)) s <- compute_sentiment(txts, lex, how = "TFIDF", tokens = toks)[["L"]]
R they equal?
all.equal(q, s)
Multi-language textual sentiment analysis requires only a few modifications to the corpus and lexicons setup. One needs first to have a non-numeric "language"
feature to be integrated into a sento_corpus
object. This feature's task is to regulate the application of the input lexicons supplied in different languages to the texts based on their associated language tag.
corpusdt <- data.table( id = as.character(1:3), date = Sys.Date(), texts = c("Dit is goed. Bien, good.", "Çeci est bien. Goed, good.", "This is good. Goed, bien."), language = c("nl", "fr", "en"), ftr = 0.5 # a numeric feature ) corpus <- sento_corpus(corpusdt) corpus
lexicons <- list(nl = sento_lexicons(list("lexNL" = data.frame("goed", 1))), fr = sento_lexicons(list("lexFR" = data.frame("bien", 1))), en = sento_lexicons(list("lexEN" = data.frame("good", 1)))) s <- compute_sentiment(corpus, lexicons, "counts") s
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.