txt_recode_ngram: Recode words with compound multi-word expressions

View source: R/utils.R

txt_recode_ngramR Documentation

Recode words with compound multi-word expressions

Description

Replace in a character vector of tokens, tokens with compound multi-word expressions. So that c("New", "York") will be c("New York", NA).

Usage

txt_recode_ngram(x, compound, ngram, sep = " ")

Arguments

x

a character vector of words where you want to replace tokens with compound multi-word expressions. This is generally a character vector as returned by the token column of as.data.frame(udpipe_annotate(txt))

compound

a character vector of compound words multi-word expressions indicating terms which can be considered as one word. For example c('New York', 'Brussels Hoofdstedelijk Gewest').

ngram

a integer vector of the same length as compound indicating how many terms there are in the specific compound multi-word expressions given by compound, where compound[i] contains ngram[i] words. So if x is c('New York', 'Brussels Hoofdstedelijk Gewest'), the ngram would be c(2, 3)

sep

separator used when the compounds were constructed by combining the words together into a compound multi-word expression. Defaults to a space: ' '.

Value

the same character vector x where elements in x will be replaced by compound multi-word expression. If will give preference to replacing with compounds with higher ngrams if these occur. See the examples.

See Also

txt_nextgram

Examples

x <- c("I", "went", "to", "New", "York", "City", "on", "holiday", ".")
y <- txt_recode_ngram(x, compound = "New York", ngram = 2, sep = " ")
data.frame(x, y)

keyw <- data.frame(keyword = c("New-York", "New-York-City"), ngram = c(2, 3))
y <- txt_recode_ngram(x, compound = keyw$keyword, ngram = keyw$ngram, sep = "-")
data.frame(x, y)

## Example replacing adjectives followed by a noun with the full compound word
data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "nl")
keyw <- keywords_phrases(x$xpos, term = x$token, pattern = "JJNN", 
                         is_regex = TRUE, detailed = FALSE)
head(keyw)
x$term <- txt_recode_ngram(x$token, compound = keyw$keyword, ngram = keyw$ngram)
head(x[, c("token", "term", "xpos")], 12)

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.