chinese.misc-package: Miscellaneous Tools for Chinese Text Mining and More

Description Author(s) Examples

Description

This package aims to help accomplish the basic tasks of Chinese text mining in a more efficient way. The manual in Chinese is in https://github.com/githubwwwjjj/chinese.misc. Compared with other packages and functions, the package puts more weight on the following three points: (1) It helps save users' time. (2) It helps decrease errors (it tolerates and corrects input errors, if it can; and if it cannot, it gives meaningful error messages). (3) Although the functions in this package depend on tm and stringi, several steps and the values of arguments have been specially set to facilitate processing Chinese text. For example, corp_or_dtm creates corpus or document term matrix, users only need to input folder names or file names, and the function will automatically detect file encoding, segment terms, modify texts, remove stop words. txt2csv and csv2txt help convert the format of texts and do some data cleaning. And there are some functions for object class assertion and coercion.

Author(s)

Jiang Wu

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
require(tm)
# Since no Chinese character is allowed, here we 
# use English instead.
# Make a document term matrix in 1 step, few arguments have 
# to be modified by the user.
x <- c(
  "Hello, what do you want to drink?", 
  "drink a bottle of milk", 
  "drink a cup of coffee", 
  "drink some water", 
  "hello, drink a cup of coffee")
dtm <- corp_or_dtm(x, from = "v", type = "dtm")
# Coerce list containing data frames and other lists
df <- data.frame(matrix(c(66, 77, NA, 99), nr = 2))
l <- list(a = 1:4, b = factor(c(10, 20, NA, 30)), c = c('x', 'y', NA, 'z'), d = df)
l2 <- list(l, l, cha = c('a', 'b', 'c'))
as.character2(l2)

Example output

Loading required package: tm
Loading required package: NLP
CHECKING ARGUMENTS
PROCESSING CHARACTER VECTOR
GENERATING CORPUS
PROCESSING CORPUS
MAKING DTM/TDM
DONE
Warning messages:
1: In Sys.setlocale(category = "LC_COLLATE", s_right_locale) :
  OS reports request to set locale to "zh_CN.UTF-8" cannot be honored
2: In Sys.setlocale(category = "LC_CTYPE", s_right_locale) :
  OS reports request to set locale to "zh_CN.UTF-8" cannot be honored
3: In tm_map.SimpleCorpus(corp, tm::removePunctuation) :
  transformation drops documents
4: In tm_map.SimpleCorpus(corp, tm::removeNumbers) :
  transformation drops documents
5: In tm_map.SimpleCorpus(corp, tm::content_transformer(tolower)) :
  transformation drops documents
6: In tm_map.SimpleCorpus(corp, tm::stripWhitespace) :
  transformation drops documents
 [1] "1"  "2"  "3"  "4"  "10" "20" NA   "30" "x"  "y"  NA   "z"  "66" "77" NA  
[16] "99" "1"  "2"  "3"  "4"  "10" "20" NA   "30" "x"  "y"  NA   "z"  "66" "77"
[31] NA   "99" "a"  "b"  "c" 

chinese.misc documentation built on Sept. 13, 2020, 5:13 p.m.