View source: R/create_ref_tcm.R View source: R/create_ref_tcm.R
create_reference_tcm | R Documentation |
Wrapper around text2vec::create_tcm with suitable settings to generate tcms as required by text2vec::coherence as reference tcms. Please note that the documentation requires improvement, meanwhile you might check the example to understand what is happening.
create_reference_tcm(dtm, tokens_ext, ngram_order = NULL, tcm_specs = tcm_specs_standard(), dir_save = getwd())
dtm |
The dtm of the corpus under investigation. |
tokens_ext |
The token lists of the external reference corpus. |
ngram_order |
The maximum upper limit of ngram order to be considered. Passed as c(1L, ngram_order) to text2vec::create_vocabulary. If not specified maximum ngram order found in dtm is used (this may or may not be reasonable). |
tcm_specs |
The specifications to create tcms. A data.table with one row per tcm to be created.
The default creates 4 tcms. Specififications can be viewed via calling:
|
dir_save |
The directory to save the tcms as |
Tcms are stored as separate files.
doc = c("A B x x x x x x x C") doc_ext = c("A x", "A x x x x x x x B") tokens = word_tokenizer(doc) it = itoken(tokens) v = create_vocabulary(it, ngram = c( 1L, ngram_max = 2L)) vectorizer = vocab_vectorizer(v) dtm = create_dtm(it, vectorizer) # specify test dir for saving that is removed at the end of this example dir_test = paste0(getwd(), "/create_reference_tcm_test") dir.create(dir_test) tokens_ext = word_tokenizer(doc_ext) create_reference_tcm(dtm , tokens_ext , dir_save = dir_test) list.files(dir_test) # [1] "tcm__standard_ref_1__ext__ws_5.rds" "tcm__standard_ref_2__ext__ws_10.rds" # [3] "tcm__standard_ref_3__ext__ws_110.rds" "tcm__standard_ref_4__int__ws_Inf.rds" # check two of the created tcms tcm5 = readRDS(paste0(dir_test, "/tcm__standard_ref_1__ext__ws_5.rds")) attr(tcm5, "term_coverage_rate_tcm_dtm") # [1] 0.75 # compare tokens_doc = unique(unlist(strsplit(doc, " "))) tokens_doc_ext = unique(unlist(strsplit(doc_ext, " "))) sum(tokens_doc_ext %in% tokens_doc) / length(tokens_doc) tcm5 # 3 x 3 sparse Matrix of class "dgTMatrix" # B A x # B 1 . 5 # A . 2 2 # x . . 8 tcm110 = readRDS(paste0(dir_test, "/tcm__standard_ref_3__ext__ws_110.rds")) attr(tcm110, "term_coverage_rate_tcm_dtm") # [1] 0.75 tcm110 # 3 x 3 sparse Matrix of class "dgTMatrix" # B A x # B 1 1 7 # A . 2 2 # x . . 8 tcm110-tcm5 # 3 x 3 sparse Matrix of class "dgCMatrix" # B A x # B 0 1 2 # A . 0 0 # x . . 0 # compare diagonal to input doc statistics table(unlist(strsplit(doc_ext, " "))) # A B x # 2 1 8 # for logic of counting co-occurrence check: # https://github.com/dselivanov/text2vec/issues/253 # delete test directory unlink(dir_test, recursive = TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.