knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
SudachiR is an R version of Sudachi, a Japanese morphological analyzer.
You can install the released version of {sudachir}
from CRAN with:
install.packages("sudachir")
and also, the developmment version from GitHub.
if (!requireNamespace("remotes")) install.packages("remotes") remotes::install_github("uribo/sudachir")
{sudachir}
works with sudachipy (>= 0.6.*) via the reticulate package.
To get started, it requires a Python environment that has sudachipy and its dictionaries already installed and available.
This package provides a function install_sudachipy
which helps users prepare a Python virtual environment. The desired modules (sudachipy
, sudachidict_core
, pandas
) can be installed with this function, but can also be installed manually.
library(reticulate) library(sudachir) if (!virtualenv_exists("r-sudachipy")) { install_sudachipy() } use_virtualenv("r-sudachipy", required = TRUE)
Use tokenize_to_df
for tokenization.
txt <- c( "国家公務員は鳴門海峡に行きたい", "吾輩は猫である。\n名前はまだない。" ) tokenize_to_df(data.frame(doc_id = c(1, 2), text = txt))
You can control which dictionary features are parsed using the col_select
argument.
tokenize_to_df(txt, col_select = 1:3) |> dplyr::glimpse() tokenize_to_df( txt, into = dict_features("en"), col_select = c("pos1", "pos2") ) |> dplyr::glimpse()
The as_tokens
function can tidy up tokens and the first part-of-speech informations into a list of named tokens. Also, you can use the form
function as a shorthand of tokenize_to_df(txt) |> as_tokens()
.
tokenize_to_df(txt) |> as_tokens(type = "surface") form(txt, type = "surface") form(txt, type = "normalized") form(txt, type = "dictionary") form(txt, type = "reading")
tokenize_to_df(txt, instance = rebuild_tokenizer("B")) |> as_tokens("surface", pos = FALSE) tokenize_to_df(txt, instance = rebuild_tokenizer("A")) |> as_tokens("surface", pos = FALSE)
You can touch dictionary options using the rebuild_tokenizer
function.
if (py_module_available("sudachidict_full")) { tokenizer_full <- rebuild_tokenizer(mode = "C", dict_type = "full") tokenize_to_df(txt, instance = tokenizer_full) |> as_tokens("surface", pos = FALSE) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.