compute_sentiment: Compute textual sentiment across features and lexicons
In sentometrics: An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction

compute_sentiment

R Documentation

Compute textual sentiment across features and lexicons

Description

Given a corpus of texts, computes sentiment per document or sentence using the valence shifting augmented bag-of-words approach, based on the lexicons provided and a choice of aggregation across words.

Usage

compute_sentiment(
  x,
  lexicons,
  how = "proportional",
  tokens = NULL,
  do.sentence = FALSE,
  nCore = 1
)

Arguments

`x`	either a `sento_corpus` object created with `sento_corpus`, a quanteda `corpus` object, a tm `SimpleCorpus` object, a tm `VCorpus` object, or a `character` vector. Only a `sento_corpus` object incorporates a date dimension. In case of a `corpus` object, the `numeric` columns from the `docvars` are considered as features over which sentiment will be computed. In case of a `character` vector, sentiment is only computed across lexicons.
`lexicons`	a `sento_lexicons` object created using `sento_lexicons`.
`how`	a single `character` vector defining how to perform aggregation within documents or sentences. For available options, see `get_hows()$words`.
`tokens`	a `list` of tokenized documents, or if `do.sentence = TRUE` a `list` of `list`s of tokenized sentences. This allows to specify your own tokenization scheme. Can indirectly result from the quanteda's `tokens` function, the tokenizers package, or other (see examples). Make sure the tokens are constructed from (the texts from) the `x` argument, are unigrams, and preferably set to lowercase, otherwise, results may be spurious and errors could occur. By default set to `NULL`.
`do.sentence`	a `logical` to indicate whether the sentiment computation should be done on sentence-level rather than document-level. By default `do.sentence = FALSE`.
`nCore`	a positive `numeric` that will be passed on to the `numThreads` argument of the `setThreadOptions` function, to parallelize the sentiment computation across texts. A value of 1 (default) implies no parallelization. Parallelization will improve speed of the sentiment computation only for a sufficiently large corpus.

Details

For a separate calculation of positive (resp. negative) sentiment, provide distinct positive (resp. negative) lexicons (see the do.split option in the sento_lexicons function). All NAs are converted to 0, under the assumption that this is equivalent to no sentiment. Per default tokens = NULL, meaning the corpus is internally tokenized as unigrams, with punctuation and numbers but not stopwords removed. All tokens are converted to lowercase, in line with what the sento_lexicons function does for the lexicons and valence shifters. Word counts are based on that same tokenization.

Value

If x is a sento_corpus object: a sentiment object, i.e., a data.table containing the sentiment scores data.table with an "id", a "date" and a "word_count" column, and all lexicon-feature sentiment scores columns. The tokenized sentences are not provided but can be obtained as stringi::stri_split_boundaries(texts, type = "sentence"). A sentiment object can be aggregated (into time series) with the aggregate.sentiment function.

If x is a quanteda corpus object: a sentiment scores data.table with an "id" and a "word_count" column, and all lexicon-feature sentiment scores columns.

If x is a tm SimpleCorpus object, a tm VCorpus object, or a character vector: a sentiment scores data.table with an auto-created "id" column, a "word_count" column, and all lexicon sentiment scores columns.

When do.sentence = TRUE, an additional "sentence_id" column along the "id" column is added.

Calculation

If the lexicons argument has no "valence" element, the sentiment computed corresponds to simple unigram matching with the lexicons [unigrams approach]. If valence shifters are included in lexicons with a corresponding "y" column, the polarity of a word detected from a lexicon gets multiplied with the associated value of a valence shifter if it appears right before the detected word (examples: not good or can't defend) [bigrams approach]. If the valence table contains a "t" column, valence shifters are searched for in a cluster centered around a detected polarity word [clusters approach]. The latter approach is a simplified version of the one utilized by the sentimentr package. A cluster amounts to four words before and two words after a polarity word. A cluster never overlaps with a preceding one. Roughly speaking, the polarity of a cluster is calculated as n(1 + 0.80d)S + \sum s. The polarity score of the detected word is S, s represents polarities of eventual other sentiment words, and d is the difference between the number of amplifiers (t = 2) and the number of deamplifiers (t = 3). If there is an odd number of negators (t = 1), n = -1 and amplifiers are counted as deamplifiers, else n = 1.

The sentence-level sentiment calculation approaches each sentence as if it is a document. Depending on the input either the unigrams, bigrams or clusters approach is used. We enhanced latter approach following more closely the default sentimentr settings. They use a cluster of five words before and two words after a polarized word. The cluster is limited to the words after a previous comma and before a next comma. Adversative conjunctions (t = 4) are accounted for here. The cluster is reweighted based on the value 1 + 0.25adv, where adv is the difference between the number of adversative conjunctions found before and after the polarized word.

Author(s)

Samuel Borms, Jeroen Van Pelt, Andres Algaba

Examples

data("usnews", package = "sentometrics")
txt <- system.file("texts", "txt", package = "tm")
reuters <- system.file("texts", "crude", package = "tm")
data("list_lexicons", package = "sentometrics")
data("list_valence_shifters", package = "sentometrics")

l1 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")])
l2 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
                     list_valence_shifters[["en"]])
l3 <- sento_lexicons(list_lexicons[c("LM_en", "HENRY_en")],
                     list_valence_shifters[["en"]][, c("x", "t")])

# from a sento_corpus object - unigrams approach
corpus <- sento_corpus(corpusdf = usnews)
corpusSample <- quanteda::corpus_sample(corpus, size = 200)
sent1 <- compute_sentiment(corpusSample, l1, how = "proportionalPol")

# from a character vector - bigrams approach
sent2 <- compute_sentiment(usnews[["texts"]][1:200], l2, how = "counts")

# from a corpus object - clusters approach
corpusQ <- quanteda::corpus(usnews, text_field = "texts")
corpusQSample <- quanteda::corpus_sample(corpusQ, size = 200)
sent3 <- compute_sentiment(corpusQSample, l3, how = "counts")

# from an already tokenized corpus - using the 'tokens' argument
toks <- as.list(quanteda::tokens(corpusQSample, what = "fastestword"))
sent4 <- compute_sentiment(corpusQSample, l1[1], how = "counts", tokens = toks)

# from a SimpleCorpus object - unigrams approach
scorp <- tm::SimpleCorpus(tm::DirSource(txt))
sent5 <- compute_sentiment(scorp, l1, how = "proportional")

# from a VCorpus object - unigrams approach
## in contrast to what as.sento_corpus(vcorp) would do, the
## sentiment calculator handles multiple character vectors within
## a single corpus element as separate documents
vcorp <- tm::VCorpus(tm::DirSource(reuters))
sent6 <- compute_sentiment(vcorp, l1)

# from a sento_corpus object - unigrams approach with tf-idf weighting
sent7 <- compute_sentiment(corpusSample, l1, how = "TFIDF")

# sentence-by-sentence computation
sent8 <- compute_sentiment(corpusSample, l1, how = "proportionalSquareRoot",
                           do.sentence = TRUE)

# from a (fake) multilingual corpus
usnews[["language"]] <- "en" # add language column
usnews$language[1:100] <- "fr"
lEn <- sento_lexicons(list("FEEL_en" = list_lexicons$FEEL_en_tr,
                           "HENRY" = list_lexicons$HENRY_en),
                      list_valence_shifters$en)
lFr <- sento_lexicons(list("FEEL_fr" = list_lexicons$FEEL_fr),
                      list_valence_shifters$fr)
lexicons <- list(en = lEn, fr = lFr)
corpusLang <- sento_corpus(corpusdf = usnews[1:250, ])
sent9 <- compute_sentiment(corpusLang, lexicons, how = "proportional")

sentometrics documentation built on April 3, 2025, 6:15 p.m.