View source: R/txt_recode_fast.R
| txt_recode_ngram_fast | R Documentation |
Efficiently combines consecutive tokens into multiword expressions using C++. This function scans text sequentially to identify and merge n-gram patterns.
txt_recode_ngram_fast(x, compound, ngram, sep = " ")
x |
Character vector of tokens (e.g., lemmas or tokens) |
compound |
Character vector of multiword expressions to match |
ngram |
Integer vector indicating the length of each compound |
sep |
String separator to use when joining tokens (default: " ") |
When a multiword match is found:
The first position gets the combined multiword expression
Subsequent positions that were merged are set to NA
The function checks n-grams from longest to shortest to prioritize longer matches.
Performance: ~80-150x faster than pure R implementation for typical text data.
Character vector where matched n-grams are combined and subsequent tokens (that were merged) are set to NA
tokens <- c("machine", "learning", "is", "cool", "machine", "learning")
compounds <- c("machine learning")
ngrams <- c(2)
txt_recode_ngram_fast(tokens, compounds, ngrams, " ")
# Returns: c("machine learning", NA, "is", "cool", "machine learning", NA)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.