optimal_k_comparison: Optimal Topic Model k Comparison
In preText: Diagnostics to Assess the Effects of Text Preprocessing Decisions

Description Usage Arguments Value Examples

Calculate the optimal number of topics for LDA using perplexity for each dfm.

optimal_k_comparison(cross_validation_train_document_indicies,
  cross_validation_test_document_indicies, dfm_object_list = NULL,
  topics = c(2, 5, 10, 20, 30, 40, 50, 60, 70, 80, 90, 100), names = NULL,
  parallel = FALSE, cores = 1, intermediate_file_directory = NULL,
  intermediate_file_names = NULL)

`cross_validation_train_document_indicies`	A list of numeric vectors where the length of the list is equal to the number of splits to be used for cross validation, and each vector contains the numeric indices of documents to be used for training.
`cross_validation_test_document_indicies`	A list of numeric vectors where the length of the list is equal to the number of splits to be used for cross validation, and each vector contains the numeric indices of documents to be used for testing.
`dfm_object_list`	An optional list of quanteda dfm() objects. If none are provided, then intermediate files will be used.
`topics`	A numeric vector containing the numbers of topics to search over. Defaults to 'c(2,5,10,20,30,40,50,60,70,80,90,100)'.
`names`	optional names for each dfm to make downstream interpretation easier. Defaults to NULL.
`parallel`	Logical indicating whether model fitting should be performed in parallel. Defaults to FALSE.
`cores`	Defaults to 1, can be set to any number less than or equal to the number of cores on one's computer.
`intermediate_file_directory`	Optional directory containing Rdata files for each of the factorial preprocessing combinations.
`intermediate_file_names`	Optional vector of file names for intermediate Rdata files – one per combination.

A vector containing the optimal k for each dfm.

## Not run: 
set.seed(12345)
# load the package
library(preText)
# load in the data
data("UK_Manifestos")
# preprocess data
preprocessed_documents <- factorial_preprocessing(
    UK_Manifestos,
    use_ngrams = TRUE,
    infrequent_term_threshold = 0.02,
    verbose = TRUE)
cross_validation_splits <- 10
# create 10 test/train splits
train_inds <- vector(mode = "list", length = cross_validation_splits)
test_inds <- vector(mode = "list", length = cross_validation_splits)
# sample CV indices
for (i in 1:cross_validation_splits) {
    test <- sample(1:length(UK_Manifestos),
                   size = round(length(UK_Manifestos)/5),
                   replace = FALSE)
    train <- 1:length(UK_Manifestos)
    for (j in 1:length(test)) {
        train <- train[-which(train == test[j])]
    }
    train_inds[[i]] <- train
    test_inds[[i]] <- test
}
# get the optimal number of topics (this will take a very long time):
optimal_k <- optimal_k_comparison(
     train_inds,
     test_inds,
     preprocessed_documents$dfm_list,
     topics = c(25,50,75,100,125,150,175,200),
     names  = preprocessed_documents$labels)

## End(Not run)