This is a package for Chinese text segmentation, keyword extraction and speech tagging.
You can use worker()
to initialize a worker, and then use []
or segment()
to do the segmentation.
library(jiebaR) ## Using default settings to initialize a worker. cutter = worker() ### Note: Can not display Chinese characters here. segment( "This is a good day!" , cutter ) ## OR cutter["This is a good day!"]
You can use file path as input.
segment( "./temp.dat" , cutter ) ### Auto encoding detection.
You can initialize multiple engines simultaneously.
cutter2 = worker(type = "mix", dict = "some_path/jieba.dict.utf8", hmm = "some_path/hmm_model.utf8", user = "some_path/test.dict.utf8", detect=T, symbol = F, lines = 1e+05, output = NULL ) cutter2 ### Print information of worker
Worker Type: Mix Segment Detect Encoding : TRUE Default Encoding: UTF-8 Keep Symbols : FALSE Output Path : Write File : TRUE Max Read Lines : 1e+05 Fixed Model Components: $dict [1] "dict/jieba.dict.utf8" $hmm [1] "dict/hmm_model.utf8" $user [1] "dict/test.dict.utf8" $detect $encoding $symbol $output $write $lines can be reset.
The public settings of the model can be modified by $
cutter$symbol = T
. Private settings are fixed when the engine is initialized, and you can get them by cutter$PrivateVarible
.
cutter$encoding cutter$detect cutter$detect = F cutter$detect
You can use custom dictionar. jiebaR is able to identify new words, but adding your own new words can ensure a higher accuracy. imewlconverter is a good tools for dictionary construction.
show_dictpath() ### Show path ?edit_dict() ### For more information
Speech Tagging function [.tagger
and tagging
tag each word in a sentence after segmentation, using labels compatible with ictclas.
words = "hello world" tagger = worker("tag") tagger[words]
Keyword Extraction worker use MixSegment model to cut word and use TF-IDF algorithm to find the keywords.
keys = worker("keywords", topn = 1) keys <= "words of fun"
Simhash worker can do keyword extraction and find the keywords from two inputs, and then computes Hamming distance between them.
words = "hello world" simhasher = worker("simhash",topn=1) simhasher[words]
distance("hello world" , "hello world!" , simhasher)
See https://jiebaR.qinwf.com/
https://github.com/qinwf/jiebaR
https://github.com/yanyiwu/cppjieba
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.