View source: R/strj-tokenize.R
| strj_tokenize | R Documentation |
Tokenizes Japanese character strings using a selectable segmentation engine and returns the result as a list or a data frame.
This function provides a unified interface to multiple Japanese text segmentation backends. External command-based engines were removed in v0.6.0, and all tokenization is performed using in-process implementations.
strj_segment() and strj_tinyseg() are aliases for strj_tokenize()
with the "budoux" and "tinyseg" engines, respectively.
strj_tokenize(
text,
format = c("list", "data.frame"),
engine = c("stringi", "budoux", "tinyseg"),
split = FALSE,
...
)
strj_segment(text, format = c("list", "data.frame"), split = FALSE)
strj_tinyseg(text, format = c("list", "data.frame"), split = FALSE)
text |
A character vector of Japanese text to tokenize. |
format |
A string specifying the output format. |
engine |
A string specifying the tokenization engine to use. |
split |
A logical value indicating whether |
... |
Additional arguments passed to the underlying engine. |
The following engines are supported:
"stringi": Uses ICU-based boundary analysis via stringi.
"budoux": Uses a rule-based Japanese phrase segmentation algorithm.
"tinyseg": Uses a TinySegmenter-compatible statistical model.
The legacy "mecab" and "sudachipy" engines were removed in v0.6.0.
If format = "list", a named list of character vectors, one per input
element.
If format = "data.frame", a data frame containing document identifiers
and tokenized text.
strj_tokenize(
paste0(
"\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
"\u30fc\u30f4\u30a9\u306e\u3059\u304d",
"\u3068\u304a\u3063\u305f\u98a8"
)
)
strj_tokenize(
paste0(
"\u3042\u306e\u30a4\u30fc\u30cf\u30c8",
"\u30fc\u30f4\u30a9\u306e\u3059\u304d",
"\u3068\u304a\u3063\u305f\u98a8"
),
format = "data.frame"
)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.