Quick Start Guide - jiebaR

Chinese Version

This is a package for Chinese text segmentation, keyword extraction and speech tagging.

Example

Text Segmentation

You can use worker() to initialize a worker, and then use [] or segment() to do the segmentation.

library(jiebaR)

##  Using default settings to initialize a worker.
cutter = worker()

###  Note: Can not display Chinese characters here.

segment( "This is a good day!" , cutter )

## OR cutter["This is a good day!"]

You can use file path as input.

segment( "./temp.dat" , cutter ) ### Auto encoding detection.

You can initialize multiple engines simultaneously.

cutter2 = worker(type  = "mix",
                 dict = "some_path/jieba.dict.utf8",
                 hmm   = "some_path/hmm_model.utf8",
                 user  = "some_path/test.dict.utf8",
                 detect=T,      symbol = F,
                 lines = 1e+05, output = NULL
                 )
cutter2   ### Print information of worker
Worker Type:  Mix Segment

Detect Encoding :  TRUE
Default Encoding:  UTF-8
Keep Symbols    :  FALSE
Output Path     :
Write File      :  TRUE
Max Read Lines  :  1e+05

Fixed Model Components:

$dict
[1] "dict/jieba.dict.utf8"

$hmm
[1] "dict/hmm_model.utf8"

$user
[1] "dict/test.dict.utf8"

$detect $encoding $symbol $output $write $lines can be reset.

The public settings of the model can be modified by $ cutter$symbol = T. Private settings are fixed when the engine is initialized, and you can get them by cutter$PrivateVarible.

cutter$encoding
cutter$detect
cutter$detect = F
cutter$detect

You can use custom dictionar. jiebaR is able to identify new words, but adding your own new words can ensure a higher accuracy. imewlconverter is a good tools for dictionary construction.

show_dictpath() ### Show path
?edit_dict()   ### For more information

Speech Tagging

Speech Tagging function [.tagger and tagging tag each word in a sentence after segmentation, using labels compatible with ictclas.

words = "hello world"
tagger = worker("tag")
tagger[words]

Keyword Extraction

Keyword Extraction worker use MixSegment model to cut word and use TF-IDF algorithm to find the keywords.

keys = worker("keywords", topn = 1)
keys <= "words of fun"

Simhash Distance

Simhash worker can do keyword extraction and find the keywords from two inputs, and then computes Hamming distance between them.

 words = "hello world"
 simhasher = worker("simhash",topn=1)
 simhasher[words]
distance("hello world" , "hello world!" , simhasher)

More Docs

See https://jiebaR.qinwf.com/

More Information and Issues

https://github.com/qinwf/jiebaR

https://github.com/yanyiwu/cppjieba



Try the jiebaR package in your browser

Any scripts or data that you put into this service are public.

jiebaR documentation built on Dec. 16, 2019, 1:19 a.m.