knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.width = 7, fig.height = 4, fig.align = "center")

This tutorial provides insights in how to create, enrich, transform, and analyze a sento_corpus object. A sento_corpus object is special because it always has a date column, and numeric metadata features.

Preparation  

library("sentometrics")
library("quanteda")

data("usnews")
data("list_lexicons")
data("list_valence_shifters")

Summarize a corpus through some statistics and plots

The corpus_summarize() function allows quickly investigating how your corpus looks like in terms of number of documents, number of tokens, and its metadata features. It can be done at a daily, weekly, monthly, or yearly frequency, and for all the corpus features or only a selection of them.

corpus <- sento_corpus(usnews)

summ <- corpus_summarize(corpus, by = "month", features = c("wsj", "wapo"))
stats <- summ[["stats"]]
plots <- summ[["plots"]]

The summary consists of a statistics component...

stats

... and a component with pregenerated graphs of the statistics.

plots$doc_plot # monthly evolution of the number of documents
plots$feature_plot # monthly evolution of the presence of the two journal features
plots$token_plot # monthly evolution of the token statistics

Apply quanteda corpus functions on a sento_corpus object

It is also possible to apply the many corpus manipulation functions of the quanteda package on a sento_corpus object. In fact, the sento_corpus object is built on quanteda's corpus object.

corpus <- sento_corpus(usnews)

res <- corpus_reshape(corpus, to = "sentences")
sam <- corpus_sample(corpus, 100)
seg <- corpus_segment(corpus, pattern = "stock", use_docvars = TRUE)
sub <- corpus_subset(corpus, wsj == 1)
tri <- corpus_trim(corpus, "documents", min_ntoken = 300)
trs <- corpus_trim(corpus, "sentences", min_ntoken = 40)

Enrich a sento_corpus object with features

Using the add_features() function, additional features can be added to your corpus, or generated through keywords or regex pattern matching.

corpus <- sento_corpus(usnews[, 1:3])

kw <- list(
  E = c("economy", "economic"),
  P = c("polic.|Polic.|politi.|Politi."), # a regex pattern
  U = c("uncertainty", "uncertain")
)

corpus <- add_features(corpus, keywords = kw, do.binary = TRUE, do.regex = c(FALSE, TRUE, FALSE))
docvars(corpus, "dummyFeature") <- NULL

head(docvars(corpus), 20)


SentometricsResearch/sentometrics documentation built on Aug. 20, 2021, 5:31 p.m.