warp_lda_vary_n_parallel: Fit Warp LDA models for varying n in parallel
In manuelbickel/textility: Utility functions for text mining

View source: R/warp_lda_vary_n_parallel.R

warp_lda_vary_n_parallel

R Documentation

Fit Warp LDA models for varying n in parallel

Description

LDA models with text2vec are fitted in parallel via foreach package for a set of candidate number of topics. Function is not made failsafe, yet, it is just straightforward parallel fitting without safety nets, e.g., regarding overwriting of files.

Usage

warp_lda_vary_n_parallel(dtm, n_topics, n_cores, model_dir,
  doc_topic_prior = sapply(as.character(n_topics), function(x) NA_real_,
  USE.NAMES = T), topic_word_prior = sapply(as.character(n_topics),
  function(x) NA_real_, USE.NAMES = T), convtol = 0.001, n_iter = 2000,
  seed = 42)

Arguments

`dtm`	The document term matrix to be used in LDA.
`n_topics`	`Integer` vector containing the candidate number of topics, e.g., `seq(5, 500, 5)`.
`n_cores`	Number of cores to be used. If computer should be usable during model fitting leave on processor spare via `parallel::detectCores()-1`.
`model_dir`	The directory to save the fitted models. The directory needs to end with "/", e.g., "~/mydir/" (currently no check is implemented).
`doc_topic_prior`	To be passed as a named vector. The prior parameters passed to `doc_topic_prior` in `text2vec::LDA`. By default the vector sapply(as.character(n_topics), function(x) NA_real_, USE.NAMES = T) is used and filled with the values `50/n` if the entry in the respective parallel run is `NA_real_`. n is taken from n_topics of the respective parallel run.
`topic_word_prior`	To be passed as a named vector. The prior parameter passed to `topic_word_prior` in `text2vec::LDA`. By default the vector sapply(as.character(n_topics), function(x) NA_real_, USE.NAMES = T) is used and filled with the values `1/n` if the entry in the respective parallel run is `NA_real_`. n is taken from n_topics of the respective parallel run.
`convtol`	The convergence tolerance parameter passed to `text2vec::LDA`. By default `1e-3`.
`n_iter`	The number of iterations parameter passed to `text2vec::LDA`. By default `2000`.
`seed`	The seed parameter to ensure reproducibility. By default `42`.

Value

For each n_topics a model is fitted that is put into a list with the resulting doc_topic_distr as list(model = ..., doc_topic_distr = ...) Each list is saved via saveRDS in a file in model_dir. The filenames include information on n and the elapsed time for fitting the individual model. They appear, e.g., as: "n5_Warp_LDA_model_0h_1min.rds". Especially the initial part "nX_Warp_LDA_model" may be used for programmatically accessing model files.

Examples


# data part of the example is copied from text2vec::LatendDirichletAllocation
library(text2vec)
data("movie_review")
N = 500
tokens = word_tokenizer(tolower(movie_review$review[1:N]))
it = itoken(tokens, ids = movie_review$id[1:N])
v = create_vocabulary(it)
v = prune_vocabulary(v, term_count_min = 5, doc_proportion_max = 0.2)
dtm = create_dtm(it, vocab_vectorizer(v))
# fit one model without parallel pocessing for comparison
# Note that seed and other parameters of the model have to be set
# for comparing to other model from parallel fitting
set.seed(42)
lda_model = LDA$new(n_topics = 5)
doc_topic_distr = lda_model$fit_transform(dtm, n_iter = 20)

modeldir = paste0(getwd(), "/modeldir/")
if (dir.exists(modeldir)) {
  stop("Standard directory used for this example already exists. Please change.")
} else {
  dir.create(modeldir)
}
library(doParallel)
# you might want to check the work load of your processors with your favorite monitor...
warp_lda_vary_n_parallel( dtm = dtm
                          , n_topics = c(3,5,7)
                          , n_iter = 20
                          , n_cores = detectCores()-1
                          , model_dir = modeldir
                          , seed = 42)
list.files(modeldir)
# [1] "n3_Warp_LDA_model_0h_0min.rds"
# [2] "n5_Warp_LDA_model_0h_0min.rds"
# [3] "n7_Warp_LDA_model_0h_0min.rds"
# we compare the model with 5 topics
lda_model_from_parallel = readRDS(list.files(modeldir, full.names = T)[2])
names(lda_model_from_parallel)
# [1] "model"           "doc_topic_distr"
all.equal(doc_topic_distr, lda_model_from_parallel$doc_topic_distr)
# [1] TRUE

# delete directory
unlink(paste0(getwd(), "/modeldir"), recursive=TRUE)

manuelbickel/textility documentation built on Nov. 25, 2022, 9:07 p.m.