strj_tokenize: Tokenize Japanese text
In audubon: Japanese Text Processing Tools

strj_tokenize

R Documentation

Tokenize Japanese text

Description

Tokenizes Japanese character strings using a selectable segmentation engine and returns the result as a list or a data frame.

This function provides a unified interface to multiple Japanese text segmentation backends. External command-based engines were removed in v0.6.0, and all tokenization is performed using in-process implementations.

strj_segment() and strj_tinyseg() are aliases for strj_tokenize() with the "budoux" and "tinyseg" engines, respectively.

Usage

strj_tokenize(
  text,
  format = c("list", "data.frame"),
  engine = c("stringi", "budoux", "tinyseg"),
  split = FALSE,
  ...
)

strj_segment(text, format = c("list", "data.frame"), split = FALSE)

strj_tinyseg(text, format = c("list", "data.frame"), split = FALSE)

Arguments

`text`	A character vector of Japanese text to tokenize.
`format`	A string specifying the output format.
`engine`	A string specifying the tokenization engine to use.
`split`	A logical value indicating whether `text` should be split into individual sentences before tokenization.
`...`	Additional arguments passed to the underlying engine.

Details

The following engines are supported:

"stringi": Uses ICU-based boundary analysis via stringi.
"budoux": Uses a rule-based Japanese phrase segmentation algorithm.
"tinyseg": Uses a TinySegmenter-compatible statistical model.

The legacy "mecab" and "sudachipy" engines were removed in v0.6.0.

Value

If format = "list", a named list of character vectors, one per input element. If format = "data.frame", a data frame containing document identifiers and tokenized text.

Examples

strj_tokenize(
  paste0(
    "\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
    "\u30fc\u30f4\u30a9\u306e\u3059\u304d",
    "\u3068\u304a\u3063\u305f\u98a8"
  )
)
strj_tokenize(
  paste0(
    "\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
    "\u30fc\u30f4\u30a9\u306e\u3059\u304d",
    "\u3068\u304a\u3063\u305f\u98a8"
  ),
  format = "data.frame"
)

audubon documentation built on Dec. 21, 2025, 5:07 p.m.