txt_recode_ngram: Recode words with compound multi-word expressions
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

txt_recode_ngram

R Documentation

Recode words with compound multi-word expressions

Description

Replace in a character vector of tokens, tokens with compound multi-word expressions. So that c("New", "York") will be c("New York", NA).

Usage

txt_recode_ngram(x, compound, ngram, sep = " ")

Arguments

`x`	a character vector of words where you want to replace tokens with compound multi-word expressions. This is generally a character vector as returned by the token column of `as.data.frame(udpipe_annotate(txt))`
`compound`	a character vector of compound words multi-word expressions indicating terms which can be considered as one word. For example `c('New York', 'Brussels Hoofdstedelijk Gewest')`.
`ngram`	a integer vector of the same length as `compound` indicating how many terms there are in the specific compound multi-word expressions given by `compound`, where `compound[i]` contains `ngram[i]` words. So if `x` is `c('New York', 'Brussels Hoofdstedelijk Gewest')`, the ngram would be `c(2, 3)`
`sep`	separator used when the compounds were constructed by combining the words together into a compound multi-word expression. Defaults to a space: ' '.

Value

the same character vector x where elements in x will be replaced by compound multi-word expression. If will give preference to replacing with compounds with higher ngrams if these occur. See the examples.

Examples

x <- c("I", "went", "to", "New", "York", "City", "on", "holiday", ".")
y <- txt_recode_ngram(x, compound = "New York", ngram = 2, sep = " ")
data.frame(x, y)

keyw <- data.frame(keyword = c("New-York", "New-York-City"), ngram = c(2, 3))
y <- txt_recode_ngram(x, compound = keyw$keyword, ngram = keyw$ngram, sep = "-")
data.frame(x, y)

## Example replacing adjectives followed by a noun with the full compound word
data(brussels_reviews_anno)
x <- subset(brussels_reviews_anno, language == "nl")
keyw <- keywords_phrases(x$xpos, term = x$token, pattern = "JJNN", 
                         is_regex = TRUE, detailed = FALSE)
head(keyw)
x$term <- txt_recode_ngram(x$token, compound = keyw$keyword, ngram = keyw$ngram)
head(x[, c("token", "term", "xpos")], 12)

udpipe documentation built on Jan. 6, 2023, 5:06 p.m.

udpipe index

README.md UDPipe Natural Language Processing - Basic Analytical Use Cases UDPipe Natural Language Processing - Model Building UDPipe Natural Language Processing - Parallel UDPipe Natural Language Processing - Text Annotation UDPipe Natural Language Processing - Topic Modelling Use Cases UDPipe Natural Language Processing - Try it out UDPipe Natural Language Processing - Universe

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

txt_recode_ngram: Recode words with compound multi-word expressions
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Recode words with compound multi-word expressions

Description

Usage

Arguments

Value

See Also

Examples

Related to txt_recode_ngram in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

txt_recode_ngram: Recode words with compound multi-word expressions In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

Recode words with compound multi-word expressions

Description

Usage

Arguments

Value

See Also

Examples

Related to txt_recode_ngram in udpipe...

R Package Documentation

Browse R Packages

We want your feedback!

udpipe
Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit

txt_recode_ngram: Recode words with compound multi-word expressions
In udpipe: Tokenization, Parts of Speech Tagging, Lemmatization and Dependency Parsing with the 'UDPipe' 'NLP' Toolkit