posParallel: parallel version of part-of-speech tagger

Description Usage Arguments Details Value Examples

Description

posParallel returns part-of-speech (POS) tagged morpheme of the sentence.

Usage

1
2
posParallel(sentence, join = TRUE, format = c("list", "data.frame"),
  sys_dic = "", user_dic = "")

Arguments

sentence

A character vector of any length. For analyzing multiple sentences, put them in one character vector.

join

A bool to decide the output format. The default value is TRUE. If FALSE, the function will return morphemes only, and tags put in the attribute. if format="data.frame", then this will be ignored.

format

A data type for the result. The default value is "list". You can set this to "data.frame" to get a result as data frame format.

sys_dic

A location of system MeCab dictionary. The default value is "".

user_dic

A location of user-specific MeCab dictionary. The default value is "".

Details

This is a parallelized version of MeCab part-of-speech tagger. The function gets a character vector of any length and runs a loop inside C++ with Intel TBB to provide faster processing.

Parallelizing over a character vector is not supported by RcppParallel. Thus, this function makes duplicates of the input and the output. Therefore, if your data volume is large, use pos or divide the vector to several sub-vectors.

You can add a user dictionary to user_dic. It should be compiled by mecab-dict-index. You can find an explatation about compiling a user dictionary in the https://github.com/junhewk/RcppMeCab.

You can also set a system dictionary especially if you are using multiple dictionaries (for example, using both IPA and Juman dictionary at the same time in Japanese) in sys_dic. Using options(mecabSysDic=), you can set your prefered system dictionary to the R terminal.

If you want to get a morpheme only, use join = False to put tag names on the attribute. Basically, the function will return a list of character vectors with (morpheme)/(tag) elements.

Value

A string vector of POS tagged morpheme will be returned in conjoined character vecter form. Element name of the list are original phrases

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
## Not run: 
sentence <- c(#some UTF-8 texts)
posParallel(sentence)
posParallel(sentence, join = FALSE)
posParallel(sentence, format = "data.frame")
posParallel(sentence, user_dic = "~/user_dic.dic")
# System dictionary example: in case of using mecab-ipadic-NEologd
pos(sentence, sys_dic = "/usr/local/lib/mecab/dic/mecab-ipadic-neologd/")

## End(Not run)

RcppMeCab documentation built on May 2, 2019, 5:08 a.m.