create_reference_tcm: Create several reference tcms with different settings for...
In manuelbickel/textility: Utility functions for text mining

View source: R/create_ref_tcm.R View source: R/create_ref_tcm.R

create_reference_tcm

R Documentation

Create several reference tcms with different settings for coherence calculation

Description

Wrapper around text2vec::create_tcm with suitable settings to generate tcms as required by text2vec::coherence as reference tcms. Please note that the documentation requires improvement, meanwhile you might check the example to understand what is happening.

Usage

create_reference_tcm(dtm, tokens_ext, ngram_order = NULL,
  tcm_specs = tcm_specs_standard(), dir_save = getwd())

Arguments

`dtm`	The dtm of the corpus under investigation.
`tokens_ext`	The token lists of the external reference corpus.
`ngram_order`	The maximum upper limit of ngram order to be considered. Passed as c(1L, ngram_order) to text2vec::create_vocabulary. If not specified maximum ngram order found in dtm is used (this may or may not be reasonable).
`tcm_specs`	The specifications to create tcms. A data.table with one row per tcm to be created. The default creates 4 tcms. Specififications can be viewed via calling: `tcm_specs_standard()`. If own specifications shall be used, the data.table returned by above call may be amended or changed.
`dir_save`	The directory to save the tcms as `.rds` files. Function handles directory names with or without the final "\" or "/".

Value

Tcms are stored as separate files.

Examples

doc = c("A B x x x x x x x C")
doc_ext = c("A x", "A x x x x x x x B")

tokens = word_tokenizer(doc)
it = itoken(tokens)
v = create_vocabulary(it,  ngram = c( 1L, ngram_max = 2L))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)

# specify test dir for saving that is removed at the end of this example
dir_test = paste0(getwd(), "/create_reference_tcm_test")
dir.create(dir_test)

tokens_ext = word_tokenizer(doc_ext)

create_reference_tcm(dtm
                     , tokens_ext
                     , dir_save = dir_test)
list.files(dir_test)
# [1] "tcm__standard_ref_1__ext__ws_5.rds"   "tcm__standard_ref_2__ext__ws_10.rds"
# [3] "tcm__standard_ref_3__ext__ws_110.rds" "tcm__standard_ref_4__int__ws_Inf.rds"

# check two of the created tcms
tcm5 = readRDS(paste0(dir_test, "/tcm__standard_ref_1__ext__ws_5.rds"))

attr(tcm5, "term_coverage_rate_tcm_dtm")
# [1] 0.75
# compare
tokens_doc = unique(unlist(strsplit(doc, " ")))
tokens_doc_ext = unique(unlist(strsplit(doc_ext, " ")))
sum(tokens_doc_ext %in% tokens_doc) / length(tokens_doc)

tcm5
# 3 x 3 sparse Matrix of class "dgTMatrix"
#   B A x
# B 1 . 5
# A . 2 2
# x . . 8

tcm110 = readRDS(paste0(dir_test, "/tcm__standard_ref_3__ext__ws_110.rds"))
attr(tcm110, "term_coverage_rate_tcm_dtm")
# [1] 0.75
tcm110
# 3 x 3 sparse Matrix of class "dgTMatrix"
#   B A x
# B 1 1 7
# A . 2 2
# x . . 8
tcm110-tcm5
# 3 x 3 sparse Matrix of class "dgCMatrix"
#   B A x
# B 0 1 2
# A . 0 0
# x . . 0
# compare diagonal to input doc statistics
table(unlist(strsplit(doc_ext, " ")))
# A B x
# 2 1 8

# for logic of counting co-occurrence check:
# https://github.com/dselivanov/text2vec/issues/253

# delete test directory
unlink(dir_test, recursive = TRUE)

manuelbickel/textility documentation built on Nov. 25, 2022, 9:07 p.m.