This vignette shows how to work with Chinese language materials using the
corpus package. It's based on Haiyan Wang's rOpenSci demo
and assumes you have httr
, stringi
, and wordcloud
installed.
We'll start by loading the package and setting a seed to ensure reproducible results
library("corpus") set.seed(100)
First download a stop word list suitable for Chinese, the Baidu stop words
cstops <- "https://raw.githubusercontent.com/ropensci/textworkshop17/master/demos/chineseDemo/ChineseStopWords.txt" csw <- paste(readLines(cstops, encoding = "UTF-8"), collapse = "\n") # download csw <- gsub("\\s", "", csw) # remove whitespace stop_words <- strsplit(csw, ",")[[1]] # extract the comma-separated words
Next, download some demonstration documents. These are in plain text format, encoded in UTF-8.
gov_reports <- "https://api.github.com/repos/ropensci/textworkshop17/contents/demos/chineseDemo/govReports" raw <- httr::GET(gov_reports) paths <- sapply(httr::content(raw), function(x) x$path) names <- tools::file_path_sans_ext(basename(paths)) urls <- sapply(httr::content(raw), function(x) x$download_url) text <- sapply(urls, function(url) paste(readLines(url, warn = FALSE, encoding = "UTF-8"), collapse = "\n")) names(text) <- names
Corpus does not know how to tokenize languages with no spaces between words.
Fortunately, the ICU library (used internally by the stringi
package) does,
by using a dictionary of words along with information about their relative
usage rates.
We use stringi
's tokenizer, collect a dictionary of the word types,
and then manually insert zero-width spaces between tokens.
toks <- stringi::stri_split_boundaries(text, type = "word") dict <- unique(c(toks, recursive = TRUE)) # unique words text2 <- sapply(toks, paste, collapse = "\u200b")
and put the input text in a corpus data frame for convenient analysis
data <- corpus_frame(name = names, text = text2)
We then specify a token filter to determine what is counted by other corpus
functions. Here we set combine = dict
so that multi-word
tokens get treated as single entities
f <- text_filter(drop_punct = TRUE, drop = stop_words, combine = dict) (text_filter(data) <- f) # set the text column's filter
Text filter with the following options: map_case: TRUE map_quote: TRUE remove_ignorable: TRUE combine: chr [1:12033] "\n" "1954" "年" "政府" "工作" "报告" "—" ... stemmer: NULL stem_dropped: FALSE stem_except: NULL drop_letter: FALSE drop_number: FALSE drop_punct: TRUE drop_symbol: FALSE drop: chr [1:717] "按" "按照" "俺" "俺们" "阿" "别" "别人" "别处" ... drop_except: NULL connector: _ sent_crlf: FALSE sent_suppress: chr [1:155] "A." "A.D." "a.m." "A.M." "A.S." "AA." ...
We can now compute type, token, and sentence counts
text_stats(data)
tokens types sentences 1 8694 2023 453 2 21079 2780 981 3 9079 1342 495 4 13009 2334 704 5 9347 1973 412 6 11640 2263 577 7 3889 1128 164 8 7303 1697 387 9 2020 839 125 10 11935 2744 659 11 10652 2412 505 12 5634 1300 342 13 12464 2588 671 14 11867 2478 585 15 9018 2267 487 16 7187 1976 413 17 5714 1519 292 18 10540 2149 481 19 8694 1895 405 20 11830 2429 653 ⋮ (49 rows total)
and examine term frequencies
(stats <- term_stats(data))
term count support 1 发展 5627 49 2 经济 5036 49 3 社会 4255 49 4 建设 4248 49 5 人民 2897 49 6 主义 2817 49 7 工作 2642 49 8 企业 2627 49 9 国家 2595 49 10 加强 2438 49 11 生产 2407 49 12 年 2021 49 13 我国 1999 49 14 提高 1947 49 15 中 1860 49 16 增长 1800 49 17 化 1740 49 18 继续 1670 49 19 技术 1586 49 20 工业 1580 49 ⋮ (11612 rows total)
These operations all use the text_filter(data)
value we set above to determine
the token and sentence boundaries.
We can visualize word frequencies with a wordcloud. You may want to use a font suitable for Chinese ('STSong' is a good choice for Mac users). We switch to this font, create the wordcloud, then switch back.
font_family <- par("family") # the previous font family par(family = "STSong") # change to a nice Chinese font with(stats, { wordcloud::wordcloud(term, count, min.freq = 500, random.order = FALSE, rot.per = 0.25, colors = RColorBrewer::brewer.pal(8, "Dark2")) })
par(family = font_family) # switch the font back
Here are the terms that show up in sentences containing a particular term
sents <- text_split(data) # split text into sentences subset <- text_subset(sents, '\u6539\u9769') # select those with the term term_stats(subset) # count the word occurrences
term count support 1 改革 2931 2457 2 发展 866 652 3 体制 768 649 4 经济 1016 639 5 推进 522 491 6 深化 473 469 7 社会 664 464 8 建设 513 391 9 制度 452 364 10 开放 389 353 11 企业 489 339 12 工作 301 268 13 积极 262 252 14 继续 260 251 15 管理 281 249 16 进行 239 232 17 加快 230 225 18 化 261 224 19 加强 248 224 20 主义 275 221 ⋮ (3888 rows total)
The first term is the search query. It appears 2931 times in the corpus, in 2457 different sentences. The second term in the list appears in 652 of 2457 sentences containing the search term. (I don't speak Chinese, but Google translate tells me that the search term is "reform", and the second and third items in the list are "development" and "system".)
Finally, here's how we might show terms in their local context
text_locate(data, "\u6027")
text before instance after 1 1 …业方面的重要问题之一是计划 性 不足。我们现在还有许多计划不… 2 1 …技术和提高劳动生产率的积极 性 ,对于发展经济建设很有害,因… 3 1 …分表现了人民群众的政治积极 性 和政治觉悟的提高,充分表现了… 4 1 …器、氢武器和其他大规模毁灭 性 武器的愿望必须满足。这些都是… 5 2 …众在劳动战线上的高度的积极 性 和创造性,依靠全国人民在改革… 6 2 …战线上的高度的积极性和创造 性 ,依靠全国人民在改革土地制度… 7 2 …,已经分得了土地,生产积极 性 很高,何必实行合作化呢?我们… 8 2 …觉悟,充分地发挥群众的积极 性 和创造性,提高劳动生产率。\n… 9 2 …分地发挥群众的积极性和创造 性 ,提高劳动生产率。\n\n 我… 10 2 …必须照顾单干农户的生产积极 性 ,给单干农户以积极的帮助和领… 11 2 …重要办法。国家从缩减非生产 性 建设的支出和行政机关的经费等… 12 2 …,更重要的是发扬地方的积极 性 ,加强地方党政机关对农业的领… 13 2 …策,提高农民群众的生产积极 性 ,保证这个计划的实现。地方的… 14 2 …,刺激和发挥农民的经营积极 性 。\n\n 有些农村的地方国家… 15 2 …设以后,为了加强生产的计划 性 ,对许多重要原料,有的由国家… 16 2 …进一步地提高农民的增产积极 性 ,促进农业生产的发展。这对于… 17 2 …高农民特别是中农的生产积极 性 。\n\n 关于粮食的计划收购… 18 2 …害关系,从而更加积极地创造 性 地参加国家建设。\n\n 人们… 19 2 …,我们必须大大地削减非生产 性 建设的支出。几年来在非生产性… 20 2 …年计划中,工业部门的非生产 性 投资只占全部投资的百分之十四… ⋮ (1341 rows total)
Note: the alignment looks bad here because the Chinese characters have widths between 1 and 2 spaces each. The spacing in the table is set assuming that Chinese characters take exactly 2 spaces each. If you know how to set the font to make the widths agree, please contact me.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.