View source: R/create_ref_tcm.R View source: R/create_ref_tcm.R
| create_reference_tcm | R Documentation |
Wrapper around text2vec::create_tcm with suitable settings to generate tcms as required by text2vec::coherence as reference tcms. Please note that the documentation requires improvement, meanwhile you might check the example to understand what is happening.
create_reference_tcm(dtm, tokens_ext, ngram_order = NULL, tcm_specs = tcm_specs_standard(), dir_save = getwd())
dtm |
The dtm of the corpus under investigation. |
tokens_ext |
The token lists of the external reference corpus. |
ngram_order |
The maximum upper limit of ngram order to be considered. Passed as c(1L, ngram_order) to text2vec::create_vocabulary. If not specified maximum ngram order found in dtm is used (this may or may not be reasonable). |
tcm_specs |
The specifications to create tcms. A data.table with one row per tcm to be created.
The default creates 4 tcms. Specififications can be viewed via calling:
|
dir_save |
The directory to save the tcms as |
Tcms are stored as separate files.
doc = c("A B x x x x x x x C")
doc_ext = c("A x", "A x x x x x x x B")
tokens = word_tokenizer(doc)
it = itoken(tokens)
v = create_vocabulary(it, ngram = c( 1L, ngram_max = 2L))
vectorizer = vocab_vectorizer(v)
dtm = create_dtm(it, vectorizer)
# specify test dir for saving that is removed at the end of this example
dir_test = paste0(getwd(), "/create_reference_tcm_test")
dir.create(dir_test)
tokens_ext = word_tokenizer(doc_ext)
create_reference_tcm(dtm
, tokens_ext
, dir_save = dir_test)
list.files(dir_test)
# [1] "tcm__standard_ref_1__ext__ws_5.rds" "tcm__standard_ref_2__ext__ws_10.rds"
# [3] "tcm__standard_ref_3__ext__ws_110.rds" "tcm__standard_ref_4__int__ws_Inf.rds"
# check two of the created tcms
tcm5 = readRDS(paste0(dir_test, "/tcm__standard_ref_1__ext__ws_5.rds"))
attr(tcm5, "term_coverage_rate_tcm_dtm")
# [1] 0.75
# compare
tokens_doc = unique(unlist(strsplit(doc, " ")))
tokens_doc_ext = unique(unlist(strsplit(doc_ext, " ")))
sum(tokens_doc_ext %in% tokens_doc) / length(tokens_doc)
tcm5
# 3 x 3 sparse Matrix of class "dgTMatrix"
# B A x
# B 1 . 5
# A . 2 2
# x . . 8
tcm110 = readRDS(paste0(dir_test, "/tcm__standard_ref_3__ext__ws_110.rds"))
attr(tcm110, "term_coverage_rate_tcm_dtm")
# [1] 0.75
tcm110
# 3 x 3 sparse Matrix of class "dgTMatrix"
# B A x
# B 1 1 7
# A . 2 2
# x . . 8
tcm110-tcm5
# 3 x 3 sparse Matrix of class "dgCMatrix"
# B A x
# B 0 1 2
# A . 0 0
# x . . 0
# compare diagonal to input doc statistics
table(unlist(strsplit(doc_ext, " ")))
# A B x
# 2 1 8
# for logic of counting co-occurrence check:
# https://github.com/dselivanov/text2vec/issues/253
# delete test directory
unlink(dir_test, recursive = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.