In uribo/sudachir: R Interface to 'Sudachi'

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

sudachir

SudachiR is an R version of Sudachi, a Japanese morphological analyzer.

Installation

You can install the released version of {sudachir} from CRAN with:

install.packages("sudachir")

and also, the developmment version from GitHub.

if (!requireNamespace("remotes"))
  install.packages("remotes")

remotes::install_github("uribo/sudachir")

Usage

Set up 'r-sudachipy' environment

{sudachir} works with sudachipy (>= 0.6.*) via the reticulate package.

To get started, it requires a Python environment that has sudachipy and its dictionaries already installed and available.

This package provides a function install_sudachipy which helps users prepare a Python virtual environment. The desired modules (sudachipy, sudachidict_core, pandas) can be installed with this function, but can also be installed manually.

library(reticulate)
library(sudachir)

if (!virtualenv_exists("r-sudachipy")) {
  install_sudachipy()
}

use_virtualenv("r-sudachipy", required = TRUE)

Tokenize sentences

Use tokenize_to_df for tokenization.

txt <- c(
  "国家公務員は鳴門海峡に行きたい",
  "吾輩は猫である。\n名前はまだない。"
)
tokenize_to_df(data.frame(doc_id = c(1, 2), text = txt))

You can control which dictionary features are parsed using the col_select argument.

tokenize_to_df(txt, col_select = 1:3) |>
  dplyr::glimpse()

tokenize_to_df(
  txt, 
  into = dict_features("en"),
  col_select = c("pos1", "pos2")
) |>
  dplyr::glimpse()

The as_tokens function can tidy up tokens and the first part-of-speech informations into a list of named tokens. Also, you can use the form function as a shorthand of tokenize_to_df(txt) |> as_tokens().

tokenize_to_df(txt) |> as_tokens(type = "surface")

form(txt, type = "surface")
form(txt, type = "normalized")
form(txt, type = "dictionary")
form(txt, type = "reading")

Change split mode

tokenize_to_df(txt, instance = rebuild_tokenizer("B")) |>
  as_tokens("surface", pos = FALSE)

tokenize_to_df(txt, instance = rebuild_tokenizer("A")) |>
  as_tokens("surface", pos = FALSE)

Change dictionary edition

You can touch dictionary options using the rebuild_tokenizer function.

if (py_module_available("sudachidict_full")) {
  tokenizer_full <- rebuild_tokenizer(mode = "C", dict_type = "full")
  tokenize_to_df(txt, instance = tokenizer_full) |>
    as_tokens("surface", pos = FALSE)
}