knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.width = 7, fig.height = 4, fig.align = "center")
This tutorial provides a guide on how to flexibly aggregate a qualitative corpus into many time series indices.
Preparation
library("sentometrics") library("ggplot2") library("data.table") data("usnews") data("list_lexicons") data("list_valence_shifters")
To aggregate document-level sentiment scores into time series requires only to specify a few parameters regarding the weighting and the time frequency.
corpus <- sento_corpus(usnews) s <- compute_sentiment( corpus, sento_lexicons(list_lexicons[c("GI_en", "LM_en")]), how = "counts" ) ctr <- ctr_agg(howDocs = "proportional", howTime = "equal_weight", by = "month", lag = 6) measures <- aggregate(s, ctr)
The obtained measures can be plotted, all of them, or according to the three time series dimensions (the corpus features, the lexicons and the time weighting schemes).
plot(measures) plot(measures, "features") plot(measures, "lexicons") plot(measures, "time")
The aggregation into sentiment time series does not have to be done in two steps. Below one-step approach is recommended because very easy!
corpus <- sento_corpus(usnews) lexicons <- sento_lexicons(list_lexicons[c("GI_en", "LM_en", "HENRY_en")]) ctr <- ctr_agg(howWithin = "counts", howDocs = "proportional", howTime = "linear", by = "week", lag = 14) measures <- sento_measures(corpus, lexicons, ctr) measures
Similar to extracting the peak documents in a sentiment table, we extract here the peak dates in the sentiment time series matrix. Detected below are the 5 dates where average sentiment across all sentiment measures is lowest.
peakDatesNeg <- peakdates(measures, n = 5, type = "neg", do.average = TRUE)
dtPeaks <- as.data.table(subset(measures, date %in% peakDatesNeg)) data.table(dtPeaks[, "date"], avg = rowMeans(dtPeaks[, -1]))
Sentiment measures can be further aggregated across any of the three dimensions. The computed time series are averages across the relevant measures.
corpus <- sento_corpus(usnews) lexicons <- sento_lexicons(list_lexicons[c("GI_en", "LM_en", "HENRY_en")]) ctr <- ctr_agg(howWithin = "counts", howDocs = "proportional", howTime = c("linear", "equal_weight"), by = "week", lag = 14) measures <- sento_measures(corpus, lexicons, ctr) measuresAgg <- aggregate(measures, features = list("journal" = c("wsj", "wapo")), time = list("linequal" = c("linear", "equal_weight"))) get_dimensions(measuresAgg) # inspect the components of the three dimensions
With the subset()
function, a sento_measures
object can be subsetted based on a condition, and time series with certain dimension components can be selected or deleted.
m1 <- subset(measuresAgg, 1:100) # first 100 weeks m2 <- subset(measuresAgg, date %in% tail(get_dates(measuresAgg), 100)) # last 100 weeks m3 <- subset(measuresAgg, select = "journal") m4 <- subset(measuresAgg, delete = list(c("economy", "GI_en"), c("noneconomy", "HENRY_en")))
To keep it at its simplest, all the sentiment measures computed can be condensed in a few global sentiment time series. The computation can be run in a weighted way as well.
corpus <- sento_corpus(usnews) lexicons <- sento_lexicons(list_lexicons[c("GI_en", "LM_en", "HENRY_en")]) ctr <- ctr_agg(howWithin = "counts", howDocs = "proportional", howTime = c("linear", "equal_weight"), by = "week", lag = 14) measures <- sento_measures(corpus, lexicons, ctr) measuresGlobal <- aggregate(measures, do.global = TRUE) measuresGlobal
The output in this case is not a specific sentometrics
sento_measures
object, but a data.table
object. Below produces a nice plot using the ggplot2
package.
ggplot(melt(measuresGlobal, id.vars = "date")) + aes(x = date, y = value, color = variable) + geom_line() + scale_x_date(name = "Date", date_labels = "%m-%Y") + scale_y_continuous(name = "Sentiment") + theme_bw() + sentometrics:::plot_theme(legendPos = "top") # a trick to make the plot look nicer
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.