View source: R/warp_lda_vary_n_parallel.R
warp_lda_vary_n_parallel | R Documentation |
LDA models with text2vec are fitted in parallel via foreach
package for a set of candidate number of topics.
Function is not made failsafe, yet, it is just straightforward parallel fitting without safety nets, e.g., regarding overwriting of files.
warp_lda_vary_n_parallel(dtm, n_topics, n_cores, model_dir, doc_topic_prior = sapply(as.character(n_topics), function(x) NA_real_, USE.NAMES = T), topic_word_prior = sapply(as.character(n_topics), function(x) NA_real_, USE.NAMES = T), convtol = 0.001, n_iter = 2000, seed = 42)
dtm |
The document term matrix to be used in LDA. |
n_topics |
|
n_cores |
Number of cores to be used.
If computer should be usable during model fitting leave on processor spare via |
model_dir |
The directory to save the fitted models. The directory needs to end with "/", e.g., "~/mydir/" (currently no check is implemented). |
doc_topic_prior |
To be passed as a named vector. The prior parameters passed to |
topic_word_prior |
To be passed as a named vector. The prior parameter passed to |
convtol |
The convergence tolerance parameter passed to |
n_iter |
The number of iterations parameter passed to |
seed |
The seed parameter to ensure reproducibility. By default |
For each n_topics
a model is fitted that is put into a list with the resulting doc_topic_distr as list(model = ..., doc_topic_distr = ...)
Each list is saved via saveRDS
in a file in model_dir
.
The filenames include information on n
and the elapsed time for fitting the individual model.
They appear, e.g., as: "n5_Warp_LDA_model_0h_1min.rds"
.
Especially the initial part "nX_Warp_LDA_model"
may be used for programmatically accessing model files.
# data part of the example is copied from text2vec::LatendDirichletAllocation library(text2vec) data("movie_review") N = 500 tokens = word_tokenizer(tolower(movie_review$review[1:N])) it = itoken(tokens, ids = movie_review$id[1:N]) v = create_vocabulary(it) v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2) dtm = create_dtm(it, vocab_vectorizer(v)) # fit one model without parallel pocessing for comparison # Note that seed and other parameters of the model have to be set # for comparing to other model from parallel fitting set.seed(42) lda_model = LDA$new(n_topics = 5) doc_topic_distr = lda_model$fit_transform(dtm, n_iter = 20) modeldir = paste0(getwd(), "/modeldir/") if (dir.exists(modeldir)) { stop("Standard directory used for this example already exists. Please change.") } else { dir.create(modeldir) } library(doParallel) # you might want to check the work load of your processors with your favorite monitor... warp_lda_vary_n_parallel( dtm = dtm , n_topics = c(3,5,7) , n_iter = 20 , n_cores = detectCores()-1 , model_dir = modeldir , seed = 42) list.files(modeldir) # [1] "n3_Warp_LDA_model_0h_0min.rds" # [2] "n5_Warp_LDA_model_0h_0min.rds" # [3] "n7_Warp_LDA_model_0h_0min.rds" # we compare the model with 5 topics lda_model_from_parallel = readRDS(list.files(modeldir, full.names = T)[2]) names(lda_model_from_parallel) # [1] "model" "doc_topic_distr" all.equal(doc_topic_distr, lda_model_from_parallel$doc_topic_distr) # [1] TRUE # delete directory unlink(paste0(getwd(), "/modeldir"), recursive=TRUE)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.