mallet_get_sparse_word_weights_matrix | R Documentation |
The beta matrix reporting word weights for topics can grow extremely large. The straight-forward ways to get the matrix can be slow and utterly memory inefficient. This function uses the 'topicXMLReport()'-method of the 'ParallelTopicModel' that is the most memory efficient solution we now at this stage. The trick is that weights are only reported for the top N words. Thus you can process the data as as sparse matrix, which is the memory efficient solution. See the examples as a proof that the result is equivalent indeed to the 'getTopicWords()'-method. Note however that the matrix is neither normalized nor smoothed nor algorithmized.
mallet_get_sparse_word_weights_matrix(x, n_topics = 50L, destfile = tempfile())
x |
A 'ParallelTopicModel' class object |
n_topics |
A length-one 'integer' vector, the number of topics. |
destfile |
Length-one 'character' vector, the filename of the output file. |
## Not run: # x is assumed to be any ParallelTopicModel class object m <- mallet_get_sparse_word_weights_matrix(x) beta_sparse <- as.matrix(m) beta_dense <- rJava::.jevalArray(x$getTopicWords(FALSE, FALSE), simplify = TRUE) rownames(beta_dense) <- as.character(1:nrow(beta_dense)) identical(max(beta_sparse[1,]), as.integer(max(beta_dense[1,]))) identical( unname(head(beta_sparse[1,][order(beta_sparse[1,], decreasing = TRUE)], 5)), as.integer(head(beta_dense[1,][order(beta_dense[1,], decreasing = TRUE)], 5)) ) .fn <- function(x) as.integer(unname(head(x[order(x, decreasing = TRUE)], 50))) identical(apply(beta_sparse, 1, .fn), apply(beta_dense, 1, .fn)) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.