create_reference_tcm: Create several reference tcms with different settings for...

View source: R/create_ref_tcm.R View source: R/create_ref_tcm.R

create_reference_tcmR Documentation

Create several reference tcms with different settings for coherence calculation


Wrapper around text2vec::create_tcm with suitable settings to generate tcms as required by text2vec::coherence as reference tcms. Please note that the documentation requires improvement, meanwhile you might check the example to understand what is happening.


create_reference_tcm(dtm, tokens_ext, ngram_order = NULL,
  tcm_specs = tcm_specs_standard(), dir_save = getwd())



The dtm of the corpus under investigation.


The token lists of the external reference corpus.


The maximum upper limit of ngram order to be considered. Passed as c(1L, ngram_order) to text2vec::create_vocabulary. If not specified maximum ngram order found in dtm is used (this may or may not be reasonable).


The specifications to create tcms. A data.table with one row per tcm to be created. The default creates 4 tcms. Specififications can be viewed via calling: tcm_specs_standard(). If own specifications shall be used, the data.table returned by above call may be amended or changed.


The directory to save the tcms as .rds files. Function handles directory names with or without the final "\" or "/".


Tcms are stored as separate files.


doc = c("A B x x x x x x x C")
doc_ext = c("A x", "A x x x x x x x B")

tokens = word_tokenizer(doc)
it = itoken(tokens)
v = create_vocabulary(it,  ngram = c( 1L, ngram_max = 2L))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)

# specify test dir for saving that is removed at the end of this example
dir_test = paste0(getwd(), "/create_reference_tcm_test")

tokens_ext = word_tokenizer(doc_ext)

                     , tokens_ext
                     , dir_save = dir_test)
# [1] "tcm__standard_ref_1__ext__ws_5.rds"   "tcm__standard_ref_2__ext__ws_10.rds"
# [3] "tcm__standard_ref_3__ext__ws_110.rds" "tcm__standard_ref_4__int__ws_Inf.rds"

# check two of the created tcms
tcm5 = readRDS(paste0(dir_test, "/tcm__standard_ref_1__ext__ws_5.rds"))

attr(tcm5, "term_coverage_rate_tcm_dtm")
# [1] 0.75
# compare
tokens_doc = unique(unlist(strsplit(doc, " ")))
tokens_doc_ext = unique(unlist(strsplit(doc_ext, " ")))
sum(tokens_doc_ext %in% tokens_doc) / length(tokens_doc)

# 3 x 3 sparse Matrix of class "dgTMatrix"
#   B A x
# B 1 . 5
# A . 2 2
# x . . 8

tcm110 = readRDS(paste0(dir_test, "/tcm__standard_ref_3__ext__ws_110.rds"))
attr(tcm110, "term_coverage_rate_tcm_dtm")
# [1] 0.75
# 3 x 3 sparse Matrix of class "dgTMatrix"
#   B A x
# B 1 1 7
# A . 2 2
# x . . 8
# 3 x 3 sparse Matrix of class "dgCMatrix"
#   B A x
# B 0 1 2
# A . 0 0
# x . . 0
# compare diagonal to input doc statistics
table(unlist(strsplit(doc_ext, " ")))
# A B x
# 2 1 8

# for logic of counting co-occurrence check:

# delete test directory
unlink(dir_test, recursive = TRUE)

manuelbickel/textility documentation built on Nov. 25, 2022, 9:07 p.m.