This is a package for Chinese text segmentation, keyword extraction and speech tagging.
You can use
worker() to initialize a worker, and then use
segment() to do the segmentation.
library(jiebaR) ## Using default settings to initialize a worker. cutter = worker() ### Note: Can not display Chinese characters here. segment( "This is a good day!" , cutter ) ## OR cutter["This is a good day!"]
You can use file path as input.
segment( "./temp.dat" , cutter ) ### Auto encoding detection.
You can initialize multiple engines simultaneously.
cutter2 = worker(type = "mix", dict = "some_path/jieba.dict.utf8", hmm = "some_path/hmm_model.utf8", user = "some_path/test.dict.utf8", detect=T, symbol = F, lines = 1e+05, output = NULL ) cutter2 ### Print information of worker
Worker Type: Mix Segment Detect Encoding : TRUE Default Encoding: UTF-8 Keep Symbols : FALSE Output Path : Write File : TRUE Max Read Lines : 1e+05 Fixed Model Components: $dict  "dict/jieba.dict.utf8" $hmm  "dict/hmm_model.utf8" $user  "dict/test.dict.utf8" $detect $encoding $symbol $output $write $lines can be reset.
The public settings of the model can be modified by
cutter$symbol = T. Private settings are fixed when the engine is initialized, and you can get them by
cutter$encoding cutter$detect cutter$detect = F cutter$detect
You can use custom dictionar. jiebaR is able to identify new words, but adding your own new words can ensure a higher accuracy. imewlconverter is a good tools for dictionary construction.
show_dictpath() ### Show path ?edit_dict() ### For more information
Speech Tagging function
tagging tag each word in a sentence after segmentation, using labels compatible with ictclas.
words = "hello world" tagger = worker("tag") tagger[words]
Keyword Extraction worker use MixSegment model to cut word and use TF-IDF algorithm to find the keywords.
keys = worker("keywords", topn = 1) keys <= "words of fun"
Simhash worker can do keyword extraction and find the keywords from two inputs, and then computes Hamming distance between them.
words = "hello world" simhasher = worker("simhash",topn=1) simhasher[words]
distance("hello world" , "hello world!" , simhasher)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.