txt_recode_ngram_fast: Fast n-gram recoding for multiword detection

View source: R/txt_recode_fast.R

txt_recode_ngram_fastR Documentation

Fast n-gram recoding for multiword detection

Description

Efficiently combines consecutive tokens into multiword expressions using C++. This function scans text sequentially to identify and merge n-gram patterns.

Usage

txt_recode_ngram_fast(x, compound, ngram, sep = " ")

Arguments

x

Character vector of tokens (e.g., lemmas or tokens)

compound

Character vector of multiword expressions to match

ngram

Integer vector indicating the length of each compound

sep

String separator to use when joining tokens (default: " ")

Details

When a multiword match is found:

  • The first position gets the combined multiword expression

  • Subsequent positions that were merged are set to NA

The function checks n-grams from longest to shortest to prioritize longer matches.

Performance: ~80-150x faster than pure R implementation for typical text data.

Value

Character vector where matched n-grams are combined and subsequent tokens (that were merged) are set to NA

Examples

tokens <- c("machine", "learning", "is", "cool", "machine", "learning")
compounds <- c("machine learning")
ngrams <- c(2)
txt_recode_ngram_fast(tokens, compounds, ngrams, " ")
# Returns: c("machine learning", NA, "is", "cool", "machine learning", NA)


tall documentation built on Dec. 12, 2025, 5:07 p.m.