knitr::opts_chunk$set(warning = FALSE, message = FALSE, fig.width = 7, fig.height = 4, fig.align = "center")
This tutorial provides insights in how to create, enrich, transform, and analyze a sento_corpus
object. A sento_corpus
object is special because it always has a date column, and numeric metadata features.
Preparation
library("sentometrics") library("quanteda") data("usnews") data("list_lexicons") data("list_valence_shifters")
The corpus_summarize()
function allows quickly investigating how your corpus looks like in terms of number of documents, number of tokens, and its metadata features. It can be done at a daily, weekly, monthly, or yearly frequency, and for all the corpus features or only a selection of them.
corpus <- sento_corpus(usnews) summ <- corpus_summarize(corpus, by = "month", features = c("wsj", "wapo")) stats <- summ[["stats"]] plots <- summ[["plots"]]
The summary consists of a statistics component...
stats
... and a component with pregenerated graphs of the statistics.
plots$doc_plot # monthly evolution of the number of documents plots$feature_plot # monthly evolution of the presence of the two journal features plots$token_plot # monthly evolution of the token statistics
quanteda
corpus functions on a sento_corpus
objectIt is also possible to apply the many corpus manipulation functions of the quanteda
package on a sento_corpus
object. In fact, the sento_corpus
object is built on quanteda
's corpus
object.
corpus <- sento_corpus(usnews) res <- corpus_reshape(corpus, to = "sentences") sam <- corpus_sample(corpus, 100) seg <- corpus_segment(corpus, pattern = "stock", use_docvars = TRUE) sub <- corpus_subset(corpus, wsj == 1) tri <- corpus_trim(corpus, "documents", min_ntoken = 300) trs <- corpus_trim(corpus, "sentences", min_ntoken = 40)
sento_corpus
object with featuresUsing the add_features()
function, additional features can be added to your corpus, or generated through keywords or regex pattern matching.
corpus <- sento_corpus(usnews[, 1:3]) kw <- list( E = c("economy", "economic"), P = c("polic.|Polic.|politi.|Politi."), # a regex pattern U = c("uncertainty", "uncertain") ) corpus <- add_features(corpus, keywords = kw, do.binary = TRUE, do.regex = c(FALSE, TRUE, FALSE)) docvars(corpus, "dummyFeature") <- NULL head(docvars(corpus), 20)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.