kakasi: Interface to kakasi

Description Usage Arguments Details Value Warning Note Author(s) References Examples

Description

The kakasi is an interface to the external program kakasi, KAnji KAna Simple Inverter. It is useful especially when Japanese Kanji characters are subject to convert to Romaji (ASCII) characters.

Usage

1
2
3
kakasi(x, kakasi.option="-Ha -Ka -Ja -Ea -ka",
 ITAIJIDICTPATH = Sys.getenv("ITAIJIDICTPATH", unset = NA),
 KANWADICTPATH = Sys.getenv("KANWADICTPATH", unset = NA))

Arguments

x

A character vector

kakasi.option

A chracter string specifying the options passed to kakasi library/program

ITAIJIDICTPATH

A character string specifying the path to itaijidict. Environmental variable of itaijidict passed to kakasi library.

KANWADICTPATH

A character string specifying the path to kanwadict. Environmental variable of kanwadict passed to kakasi library.

Details

Japanese strings are often made up a mixture of Chinese characters (Kanji), Kana (Hiragana and Katakana) and Romaji (Latin phonetical pronunciation). The external program kakasi converts between these four different ways of writing Japanese. kakasi and Sys.kakasi are useful especially for sanitizing a character vector by converting Japanese (non-ASCII) to ASCII characters.

kakasi uses two basic dictionaries: itaijidict and kanwadict. These dictionaries are included in doc/share of Package directory after installation of Nippon package. Since the kakasi library looks up the environmental variables to find dictionary, ITAIJIDICTPATH and KANWADICTPATH are internally set using Sys.setenv at the time when kakasi is called first time. After the first call, kakasi continues to use the environmental variables. Until R session closes, these environmental variables never unset. To use alternative dictionary instead of the bundled, a user can set the environmental variables using Sys.setenv or as arguments of kakasi. For permanent setting of environmental variables, see help of Renviron.

Value

A character vector

Warning

Note that non-Japanese and non-ASCII characters are not filtered in kakasi.kakasi warns unless LC_CTYPE is "ja_JP.UTF-8" (Linux or MacOSX) or "Japanese_Japan.932" (Windows). It is not sure whether the function is workable in other locale.

Note

Sys.kakasi was removed in Nippon ver.0.6.

kakasi warns unless LC_CTYPE is "ja_JP.UTF-8" (Linux or MacOSX) or "Japanese_Japan.932" (Windows).

The accuracy of Kanji-Kana conversion with kakasi is a bit lower than with MeCab program (http://mecab.sourceforge.net/). Although MeCab does not have a function of Kana-Romaji conversion, MeCab could be an option if you wish more accurate results. RMeCab is available from http://rmecab.jp/wiki/.

For Windows users, please be known that R on Windows can use strings encoded by both "ja_JP.UTF-8" and "Japanese_Japan.932"; however, kakasi works only with "Japanese_Japan.932". If you have data encoded with UTF-8 on Windows, you should convert it to "Japanese_Japan.932 (CP932)" as shown in example.

Author(s)

Susumu Tanimura aruminat@gmail.com

References

KAKASI - Kanji Kana Simple Inverter http://kakasi.namazu.org/

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
## Not run: 
library(Nippon)
data(prefectures)
regions <- unique(prefectures$region)
regions
# Unix-like operating systems
kakasi(regions)
# Windows
regions.cp932 <- iconv(regions, from = "UTF-8", to = "CP932")
kakasi(regions.cp932)

## End(Not run)

Nippon documentation built on May 2, 2019, 1:03 p.m.