worker: Initialize jiebaR worker
In jiebaR: Chinese Text Segmentation

worker

R Documentation

Initialize jiebaR worker

Description

This function can initialize jiebaR workers. You can initialize different kinds of workers including mix, mp, hmm, query, full, tag, simhash, and keywords. see Detail for more information.

Usage

worker(type = "mix", dict = DICTPATH, hmm = HMMPATH,
  user = USERPATH, idf = IDFPATH, stop_word = STOPPATH, write = T,
  qmax = 20, topn = 5, encoding = "UTF-8", detect = T,
  symbol = F, lines = 1e+05, output = NULL, bylines = F,
  user_weight = "max")

Arguments

`type`	The type of jiebaR workers including `mix`, `mp`, `hmm`, `full`, `query`, `tag`, `simhash`, and `keywords`.
`dict`	A path to main dictionary, default value is `DICTPATH`, and the value is used for `mix`, `mp`, `query`, `full`, `tag`, `simhash` and `keywords` workers.
`hmm`	A path to Hidden Markov Model, default value is `HMMPATH`, `full`, and the value is used for `mix`, `hmm`, `query`, `tag`, `simhash` and `keywords` workers.
`user`	A path to user dictionary, default value is `USERPATH`, and the value is used for `mix`, `full`, `tag` and `mp` workers.
`idf`	A path to inverse document frequency, default value is `IDFPATH`, and the value is used for `simhash` and `keywords` workers.
`stop_word`	A path to stop word dictionary, default value is `STOPPATH`, and the value is used for `simhash`, `keywords`, `tagger` and `segment` workers. Encoding of this file is checked by `file_coding`, and it should be UTF-8 encoding. For `segment` workers, the default `STOPPATH` will not be used, so you should provide another file path.
`write`	Whether to write the output to a file, or return a the result in a object. This value will only be used when the input is a file path. The default value is TRUE. The value is used for segment and speech tagging workers.
`qmax`	Max query length of words, and the value is used for `query` workers.
`topn`	The number of keywords, and the value is used for `simhash` and `keywords` workers.
`encoding`	The encoding of the input file. If encoding detection is enable, the value of `encoding` will be ignore.
`detect`	Whether to detect the encoding of input file using `file_coding` function. If encoding detection is enable, the value of `encoding` will be ignore.
`symbol`	Whether to keep symbols in the sentence.
`lines`	The maximal number of lines to read at one time when input is a file. The value is used for segmentation and speech tagging workers.
`output`	A path to the output file, and default worker will generate file name by system time stamp, the value is used for segmentation and speech tagging workers.
`bylines`	return the result by the lines of input files
`user_weight`	the weight of the user dict words. "min" "max" or "median".

Details

The package uses initialized engines for word segmentation, and you can initialize multiple engines simultaneously. You can also reset the model public settings using $ such as WorkerName$symbol = T . Some private settings are fixed when a engine is initialized, and you can get then by WorkerName$PrivateVarible.

Maximum probability segmentation model uses Trie tree to construct a directed acyclic graph and uses dynamic programming algorithm. It is the core segmentation algorithm. dict and user should be provided when initializing jiebaR worker.

Hidden Markov Model uses HMM model to determine status set and observed set of words. The default HMM model is based on People's Daily language library. hmm should be provided when initializing jiebaR worker.

MixSegment model uses both Maximum probability segmentation model and Hidden Markov Model to construct segmentation. dict hmm and user should be provided when initializing jiebaR worker.

QuerySegment model uses MixSegment to construct segmentation and then enumerates all the possible long words in the dictionary. dict, hmm and qmax should be provided when initializing jiebaR worker.

FullSegment model will enumerates all the possible words in the dictionary.

Speech Tagging worker uses MixSegment model to cut word and tag each word after segmentation using labels compatible with ictclas. dict, hmm and user should be provided when initializing jiebaR worker.

Keyword Extraction worker uses MixSegment model to cut word and use TF-IDF algorithm to find the keywords. dict ,hmm, idf, stop_word and topn should be provided when initializing jiebaR worker.

Simhash worker uses the keyword extraction worker to find the keywords and uses simhash algorithm to compute simhash. dict hmm, idf and stop_word should be provided when initializing jiebaR worker.

Value

This function returns an environment containing segmentation settings and worker. Public settings can be modified using $.

Examples

### Note: Can not display Chinese characters here.
## Not run: 
words = "hello world"
engine1 = worker()
segment(words, engine1)

# "./temp.txt" is a file path

segment("./temp.txt", engine1)

engine2 = worker("hmm")
segment("./temp.txt", engine2)

engine2$write = T
segment("./temp.txt", engine2)

engine3 = worker(type = "mix", dict = "dict_path",symbol = T)
segment("./temp.txt", engine3)
 
## End(Not run)

## Not run: 
### Keyword Extraction
engine = worker("keywords", topn = 1)
keywords(words, engine)

### Speech Tagging
tagger = worker("tag")
tagging(words, tagger)

### Simhash
simhasher = worker("simhash", topn = 1)
simhash(words, simhasher)
distance("hello world" , "hello world!" , simhasher)

show_dictpath()

## End(Not run)

jiebaR documentation built on April 4, 2025, 2:41 a.m.