In kbenoit/quanteda.dictionaries: Dictionaries for Text Analysis and Associated Utilities

knitr::opts_chunk$set(collapse = TRUE, 
                      comment = "##")

1. Introduction

Built on the quanteda package for text analysis, quanteda.dictionaries consists of dictionaries for text analysis and associated utilities. In this vignette, we show how to replicate the main features of the LIWC software with packages provided by the Quanteda Initiative. If you prefer to have a complete, stand-alone user interface, then you should purchase and use the LIWC standalone software. Moreover, the package contains dictionaries to homogenize (or homogenise!) the spellings of US and British English.

2. Using dictionaries with quanteda

library("quanteda")
library("quanteda.dictionaries")

2.1 Accessing texts

To access texts for scoring sentiment, we will use the movie reviews corpus from Pang, Lee, and Vaithyanathan (2002), found in the quanteda.textmodels package. This requires that you have installed that package. (If you want to use your own texts, we recommend the readtext package for reading in your texts.)

data(data_corpus_moviereviews, package = "quanteda.textmodels")

2.2 Analyse sentiment

Next, we analyse sentiment of text. If you have purchased the LIWC dictionary, you can load it as a quanteda-formatted dictionary in the following way.

liwc2007dict <- dictionary(file = "LIWC2007.cat", format = "wordstat")
tail(liwc2007dict, 1)
# Dictionary object with 1 primary key entry and 2 nested levels.
# - [SPOKEN CATEGORIES]:
#   - [ASSENT]:
#     - absolutely, agree, ah, alright*, aok, aw, awesome, cool, duh, ha, hah, haha*, heh*, hm*, huh, lol, mm*, oh, ok, okay, okey*, rofl, uhhu*, uhuh, yah, yay, yea, yeah, yep*, yes, yup
#   - [NON-FLUENCIES]:
#     - er, hm*, sigh, uh, um, umm*, well, zz*
#   - [FILLERS]:
#     - blah, idon'tknow, idontknow, imean, ohwell, oranything*, orsomething*, orwhatever*, rr*, yakn*, ykn*, youknow*

While you can use the LIWC dictionary which you need to purchase, in this example we use the Lexicoder 2015 political sentiment dictionary from Young and Soroka (2015). The liwcalike() function from quanteda.dictionaries gives similar output to that from the LIWC stand-alone software. We use a collection of 2,000 movie reviews classified as "positive" or "negative", a corpus which comes with quanteda.textmodels.

output_lsd <- liwcalike(data_corpus_moviereviews, data_dictionary_LSD2015)
head(output_lsd)

Next, we can use the negative and positive columns to estimate the net sentiment for each text by subtracting negative from positive words.

output_lsd$net_positive <- output_lsd$positive - output_lsd$negative
output_lsd$sentiment <- docvars(data_corpus_moviereviews, "sentiment")

library("ggplot2")
# set ggplot2 theme
theme_set(theme_minimal())
ggplot(output_lsd, aes(x = sentiment, y = net_positive)) +
    geom_boxplot() +
    labs(x = "Classified sentiment", 
         y = "Net positive sentiment",
         main = "`Lexicoder 2015 Sentiment Dictionary")

This is only meant as an example, since the Lexicoder 2015 dictionary was developed for classifying political language, not for the purpose of more general sentiment analysis. To access more nuanced sentiment dictionaries, see the quanteda.sentiment package, which also includes functions for computing polarity- and valence-based net sentiment scores.

2.3 User-created dictionaries

The LIWC software allows to build custom dictionaries created for specific research questions. With quanteda's dictionary() function we can do the same.

dict <- dictionary(list(positive = c("great", "phantastic", "wonderful"),
                        negative = c("bad", "horrible", "terrible")))

output_custom_dict <- liwcalike(data_corpus_moviereviews, dict)

head(output_custom_dict)

2.4 Apply dictionary to segmented text

LIWC provides easy segmentation, through a GUI. With functions from the quanteda package, you can segment the texts yourself using corpus_reshape() or corpus_segment(). In the following example, we divide up the inaugural speeches by paragraphs and apply a sentiment dictionary.

ndoc(data_corpus_inaugural)

The initial inaugural corpus consists of 58 documents (one document per speech).

inaug_corpus_paragraphs <- corpus_reshape(data_corpus_inaugural, to = "paragraphs")
ndoc(inaug_corpus_paragraphs)

When we divide the corpus into paragraphs, the number of documents increases to 1513. Next, we can apply the liwcalike() function to the reshaped corpus using the LSD2015 dictionary.

output_paragraphs <- liwcalike(inaug_corpus_paragraphs, data_dictionary_LSD2015)
head(output_custom_dict)

2.5 Export output to a spreadsheet

The LIWC software allows you to export the output from the dictionary analysis to a spreadsheet. We can also do this with write.csv or use packages such as haven or rio to save the data.frame in a custom file format.

# save as csv file
write.csv(output_custom_dict, file = "output_dictionary.csv",
         fileEncoding = "utf-8")

# save as Excel file (xlsx)
library(rio)
rio::export(output_custom_dict, file = "output_dictionary.xlsx")

3. Homogeni[zs]e British and US English

quanteda.dictionaries contains a English UK-US spelling conversion dictionary which provide the ability to homogeni[zs]e the spellings of English by converting the spelling variants of one language to the other. The dictionary contains 1,800 roots and derivatives which are accessible online.

txt <- c(uk = "endeavour to prioritise honour over esthetics",
         us = "endeavor to prioritize honor over aesthetics")
toks <- quanteda::tokens(txt)

If we want to change words from British English to US English we can use tokens_replace() from the quanteda package in combination with data_dictionary_uk2us.

quanteda::tokens_lookup(toks, data_dictionary_uk2us, 
                        exclusive = FALSE, capkeys = FALSE)

To make all spellings British, we use data_dictionary_us2uk instead.

quanteda::tokens_lookup(toks, data_dictionary_us2uk,
                        exclusive = FALSE, capkeys = FALSE)

We can see the difference that this makes when converting the texts to a document-feature matrix:

# original dfm
quanteda::dfm(toks)

# homogeni[zs]ed dfm
quanteda::dfm(quanteda::tokens_lookup(toks, data_dictionary_uk2us,
                                      exclusive = FALSE, capkeys = FALSE))

References

Pang, Bo, Lillian Lee, and Shivakumar Vaithyanathan. 2002. Thumbs up?: sentiment classification using machine learning techniques. In Proceedings of the ACL-02 conference on Empirical methods in natural language processing, Vol. 10 (EMNLP '02), Vol. 10. Association for Computational Linguistics, Stroudsburg, PA, USA, 79-86. DOI: https://doi.org/10.3115/1118693.1118704

Pennebaker, J.W., Chung, C.K., Ireland, M., Gonzales, A., & Booth, R.J. (2007). The development and psychometric properties of LIWC2007. [Software manual]. Austin, TX (https://www.liwc.app/).

Saif Mohammad and Peter Turney (2013). "Crowdsourcing a Word-Emotion Association Lexicon." Computational Intelligence 29(3), 436-465.

Stone, Philip J., Dexter C. Dunphy, and Marshall S. Smith. 1966. The General Inquirer: A computer approach to content analysis. Cambridge, MA: MIT Press.

Young, L. & Soroka, S. (2012). Affective News: The Automated Coding of Sentiment in Political Texts. Political Communication, 29(2), 205–231. DOI: https://doi.org/10.1080/10584609.2012.671234

kbenoit/quanteda.dictionaries documentation built on Feb. 9, 2023, 3:28 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com