txt_recode_ngram_fast: Fast n-gram recoding for multiword detection
In tall: Text Analysis for All

txt_recode_ngram_fast

R Documentation

Fast n-gram recoding for multiword detection

Description

Efficiently combines consecutive tokens into multiword expressions using C++. This function scans text sequentially to identify and merge n-gram patterns.

Usage

txt_recode_ngram_fast(x, compound, ngram, sep = " ")

Arguments

`x`	Character vector of tokens (e.g., lemmas or tokens)
`compound`	Character vector of multiword expressions to match
`ngram`	Integer vector indicating the length of each compound
`sep`	String separator to use when joining tokens (default: " ")

Details

When a multiword match is found:

The first position gets the combined multiword expression
Subsequent positions that were merged are set to NA

The function checks n-grams from longest to shortest to prioritize longer matches.

Performance: ~80-150x faster than pure R implementation for typical text data.

Value

Character vector where matched n-grams are combined and subsequent tokens (that were merged) are set to NA

Examples

tokens <- c("machine", "learning", "is", "cool", "machine", "learning")
compounds <- c("machine learning")
ngrams <- c(2)
txt_recode_ngram_fast(tokens, compounds, ngrams, " ")
# Returns: c("machine learning", NA, "is", "cool", "machine learning", NA)

tall documentation built on Feb. 12, 2026, 9:08 a.m.