| add_features | R Documentation |
Adds new feature columns, either user-supplied or based on keyword(s)/regex pattern search, to
a provided sento_corpus or a quanteda corpus object.
add_features(
corpus,
featuresdf = NULL,
keywords = NULL,
do.binary = TRUE,
do.regex = FALSE
)
corpus |
a |
featuresdf |
a named |
keywords |
a named |
do.binary |
a |
do.regex |
a |
If a provided feature name is already part of the corpus, it will be replaced. The featuresdf and
keywords arguments can be provided at the same time, or only one of them, leaving the other at NULL. We use
the stringi package for searching the keywords. The do.regex argument points to the corresponding elements
in keywords. For FALSE, we transform the keywords into a simple regex expression, involving "\b" for
exact word boundary matching and (if multiple keywords) | as OR operator. The elements associated to TRUE do
not undergo this transformation, and are evaluated as given, if the corresponding keywords vector consists of only one
expression. For a large corpus and/or complex regex patterns, this function may require some patience. Scaling between 0
and 1 is performed via min-max normalization, per column.
An updated corpus object.
Samuel Borms
set.seed(505)
# construct a corpus and add (a) feature(s) to it
corpus <- quanteda::corpus_sample(
sento_corpus(corpusdf = sentometrics::usnews), 500
)
corpus1 <- add_features(corpus,
featuresdf = data.frame(random = runif(quanteda::ndoc(corpus))))
corpus2 <- add_features(corpus,
keywords = list(pres = "president", war = "war"),
do.binary = FALSE)
corpus3 <- add_features(corpus,
keywords = list(pres = c("Obama", "US president")))
corpus4 <- add_features(corpus,
featuresdf = data.frame(all = 1),
keywords = list(pres1 = "Obama|US [p|P]resident",
pres2 = "\\bObama\\b|\\bUS president\\b",
war = "war"),
do.regex = c(TRUE, TRUE, FALSE))
sum(quanteda::docvars(corpus3, "pres")) ==
sum(quanteda::docvars(corpus4, "pres2")) # TRUE
# adding a complementary feature
nonpres <- data.frame(nonpres = as.numeric(!quanteda::docvars(corpus3, "pres")))
corpus3 <- add_features(corpus3, featuresdf = nonpres)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.