add_features: Add feature columns to a (sento_)corpus object
In sentometrics: An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction

add_features

R Documentation

Add feature columns to a (sento_)corpus object

Description

Adds new feature columns, either user-supplied or based on keyword(s)/regex pattern search, to a provided sento_corpus or a quanteda corpus object.

Usage

add_features(
  corpus,
  featuresdf = NULL,
  keywords = NULL,
  do.binary = TRUE,
  do.regex = FALSE
)

Arguments

`corpus`	a `sento_corpus` object created with `sento_corpus`, or a quanteda `corpus` object.
`featuresdf`	a named `data.frame` of type `numeric` where each columns is a new feature to be added to the inputted `corpus` object. If the number of rows in `featuresdf` is not equal to the number of documents in `corpus`, recycling will occur. The numeric values should be between 0 and 1 (included).
`keywords`	a named `list`. For every element, a new feature column is added with a value of 1 for the texts in which (at least one of) the keyword(s) appear(s), and 0 if not (for `do.binary = TRUE`), or with as value the normalized number of times the keyword(s) occur(s) in the text (for `do.binary = FALSE`). If no texts match a keyword, no column is added. The `list` names are used as the names of the new features. For more complex searching, instead of just keywords, one can also directly use a single regex expression to define a new feature (see the details section).
`do.binary`	a `logical`, if `do.binary = FALSE`, the number of occurrences are normalized between 0 and 1 (see argument `keywords`).
`do.regex`	a `logical` vector equal in length to the number of elements in the `keywords` argument `list`, or a single value if it applies to all. It should be set to `TRUE` at those positions where a single regex expression is used to identify the particular feature.

Details

If a provided feature name is already part of the corpus, it will be replaced. The featuresdf and keywords arguments can be provided at the same time, or only one of them, leaving the other at NULL. We use the stringi package for searching the keywords. The do.regex argument points to the corresponding elements in keywords. For FALSE, we transform the keywords into a simple regex expression, involving "\b" for exact word boundary matching and (if multiple keywords) | as OR operator. The elements associated to TRUE do not undergo this transformation, and are evaluated as given, if the corresponding keywords vector consists of only one expression. For a large corpus and/or complex regex patterns, this function may require some patience. Scaling between 0 and 1 is performed via min-max normalization, per column.

Value

An updated corpus object.

Author(s)

Samuel Borms

Examples

set.seed(505)

# construct a corpus and add (a) feature(s) to it
corpus <- quanteda::corpus_sample(
  sento_corpus(corpusdf = sentometrics::usnews), 500
)
corpus1 <- add_features(corpus,
                        featuresdf = data.frame(random = runif(quanteda::ndoc(corpus))))
corpus2 <- add_features(corpus,
                        keywords = list(pres = "president", war = "war"),
                        do.binary = FALSE)
corpus3 <- add_features(corpus,
                        keywords = list(pres = c("Obama", "US president")))
corpus4 <- add_features(corpus,
                        featuresdf = data.frame(all = 1),
                        keywords = list(pres1 = "Obama|US [p|P]resident",
                                        pres2 = "\\bObama\\b|\\bUS president\\b",
                                        war = "war"),
                        do.regex = c(TRUE, TRUE, FALSE))

sum(quanteda::docvars(corpus3, "pres")) ==
  sum(quanteda::docvars(corpus4, "pres2")) # TRUE

# adding a complementary feature
nonpres <- data.frame(nonpres = as.numeric(!quanteda::docvars(corpus3, "pres")))
corpus3 <- add_features(corpus3, featuresdf = nonpres)

sentometrics documentation built on April 3, 2025, 6:15 p.m.

sentometrics index

Package overview README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

sentometrics
An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction

add_features: Add feature columns to a (sento_)corpus object
In sentometrics: An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction

Add feature columns to a (sento_)corpus object

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to add_features in sentometrics...

R Package Documentation

Browse R Packages

We want your feedback!

sentometrics An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction

add_features: Add feature columns to a (sento_)corpus object In sentometrics: An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction

Add feature columns to a (sento_)corpus object

Description

Usage

Arguments

Details

Value

Author(s)

Examples

Related to add_features in sentometrics...

R Package Documentation

Browse R Packages

We want your feedback!

sentometrics
An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction

add_features: Add feature columns to a (sento_)corpus object
In sentometrics: An Integrated Framework for Textual Sentiment Time Series Aggregation and Prediction