segment: Segment Chinese sentence into words.

Description Usage Arguments Value

Description

A wrapper to ansj chinese words segmentation package. This function support all segmentation methods of ansj: NlpAnalysis, ToAnalysis, BaseAnalysis, DicAnalysis, IndexAnalysis, FastIndexAnalysis. More details about these methods refers to https://github.com/NLPchina/ansj_seg

Usage

1
2
3
segment(str, method = "nlp", nature = FALSE, stopwords = FALSE,
  naturesInclude = NULL, naturesRemove = NULL, nosymbol = TRUE,
  returnType = "tm")

Arguments

str

A character of chinese sentence to be segmented.

method

A chrarcter to select one segment method from:

  • "nlp": NlpAnalysis of ansj. The Most Accurate segment method in ansj. User can define dictionary. Numbers, person names, organition names and new words recognition are supported in the method.

  • "dic": DicAnalysis of ansj. User defined dictionary will have a higher priority.

  • "base":BaseAnalysis of ansj.Sentences will be segment into very short words. User defined Dictionary will not be used by this method. Numbers recognition is included in this method. Person names, organization names and new words recognition are not supported in this method.

  • "to": ToAnalysis of ansj. More accurate than "base" method. User defined dictionary will be used by the method. Numbers and person names recognition is included in this method. Organization names and new words recognition are not supported in this method.

  • "index": IndexAnalysis of ansj. This method aims at producing search index.

  • "fastIndex": FastIndexAnalysis of ansj. Faster than "index" method.

nature

logical. Whether to tag the POS. Default = FALSE.

stopwords

logical. Whether to remove stopwords from the result. Default = FALSE. Use insertStopwords first before you set parameter stopwords=TRUE

naturesInclude

Character vector. The natures to be take back from the result. Default = NULL.

naturesRemove

Character vector. The natures to be removed from the result. Default = NULL.Note that if naturesInclude and naturesRemove are both not null, naturesRemove will be ignored.

nosymbol

logical. Whether to keep symbols in the result. Default = TRUE.

returnType

A character from c("vector", "tm"). "vector" means that the result returned will be a vector containing the words. "tm" means that the result returned will be a character in which each word is separated by space.

Value

A character or character vector depends on the parameter returnType.


Juntai/ransj documentation built on May 8, 2019, 4:42 p.m.