as.instance_list | R Documentation |
Functionality to support the following workflow (see examples): (a) Turn
partition_bundle
-object into mallet instance list, (b) store the
resulting jobjRef
-object, (c) run mallet topic modelling and (d)
turn ParallelTopicModel Java object into LDA_Gibbs
object from
package topicmodels
.
as.instance_list(x, ...) ## S4 method for signature 'partition_bundle' as.instance_list(x, p_attribute = "word", verbose = TRUE, min_length = 1L, ...) ## S4 method for signature 'DocumentTermMatrix' as.instance_list(x, verbose = TRUE) ## S4 method for signature 'list' as.instance_list(x, vocabulary, docnames, verbose = TRUE, progress = TRUE) ## S4 method for signature 'character' as.instance_list(x, regex = "[\\p{L}]+", tolower = FALSE, stopwords = NULL) instance_list_save(x, filename = tempfile()) instance_list_load(filename)
x |
A 'partition_bundle' object. |
... |
Arguments passed into 'get_token_stream()' call (e.g. argument 'subset' to apply stopwords). |
p_attribute |
Length-one 'character' vector, a positional attribute. |
verbose |
A 'logical' value, whether to be verbose. |
min_length |
Minimum length of documents after removing stopwords. |
vocabulary |
A 'character' vector with the vocabulary underlying input object 'x', in the correct order. |
docnames |
A 'character' vector with document names. Needs to have same length as input 'list' object. If missing, names of the input 'list' are used as docnames, if present. |
progress |
A 'logical' value, whether to show progress bar. |
regex |
A regular expression (length-one 'character' vector) used by Mallet Java code for splitting 'character' vector into tokens. |
tolower |
A 'logical' value, whether to lowercase tokens (performed) by Mallet Java code. |
stopwords |
Either a path with a plain text file with stopwords (one per line), or a 'character' vector. |
filename |
Where to store the Java-object. |
'instance_list_load()' will load a Java 'InstanceList' object that has been saved to disk (e.g. by using the 'instance_list_save()' function). The return value is a 'jobjRef' object.
Andreas Blaette, David Mimno
# Preparations: Create instance list if (!mallet_is_installed()) mallet_install() library(polmineR) use("polmineR") speeches <- polmineR::as.speeches( "GERMAPARLMINI", s_attribute_name = "speaker", s_attribute_date = "date" ) instance_list <- as.instance_list(speeches) lda <- BigTopicModel( instances = instance_list, n_topics = 25, alpha_sum = 5.1, beta = 0.1, threads = 1L, iterations = 150L ) destfile <- tempfile() lda$setSaveSerializedModel(50L, rJava::.jnew("java/lang/String", destfile)) lda$estimate() lda$write(rJava::.jnew("java/io/File", destfile)) # Load topicmodel and turn it into LDA_Gibbs lda2 <- mallet_load_topicmodel(destfile) topicmodels_lda <- as_LDA(lda) library(polmineR) use("polmineR") speeches <- as.speeches("GERMAPARLMINI", s_attribute_name = "speaker", s_attribute_date = "date") speeches_instance_list <- as.instance_list(speeches, p_attribute = "word") # Pass argument 'subset' to remove stopwords terms_to_drop <- tm::stopwords("de") speeches_instance_list <- as.instance_list( speeches, p_attribute = "word", subset = {!get(p_attribute) %in% bquote(.(terms_to_drop))} ) data("AssociatedPress", package = "topicmodels") il <- as.instance_list(AssociatedPress) use("polmineR", corpus = "GERMAPARLMINI") vocab <- p_attributes("GERMAPARLMINI", p_attribute = "word") il <- corpus("GERMAPARLMINI") |> as.speeches(s_attribute_name = "speaker", s_attribute_date = "date") |> p_attributes(p_attribute = "word", decode = FALSE) |> as.instance_list(vocabulary = vocab) instances <- as.speeches("GERMAPARLMINI", s_attribute_name = "speaker", s_attribute_date = "date") %>% get_token_stream(p_attribute = "word", collapse = " ") %>% unlist() %>% as.instance_list()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.