as.instance_list: Interface to mallet topicmodelling.
In PolMine/biglda: Fast LDA Topic Modelling for Big Corpora

as.instance_list

R Documentation

Interface to mallet topicmodelling.

Description

Functionality to support the following workflow (see examples): (a) Turn partition_bundle-object into mallet instance list, (b) store the resulting jobjRef-object, (c) run mallet topic modelling and (d) turn ParallelTopicModel Java object into LDA_Gibbs object from package topicmodels.

Usage

as.instance_list(x, ...)

## S4 method for signature 'partition_bundle'
as.instance_list(x, p_attribute = "word", verbose = TRUE, min_length = 1L, ...)

## S4 method for signature 'DocumentTermMatrix'
as.instance_list(x, verbose = TRUE)

## S4 method for signature 'list'
as.instance_list(x, vocabulary, docnames, verbose = TRUE, progress = TRUE)

## S4 method for signature 'character'
as.instance_list(x, regex = "[\\p{L}]+", tolower = FALSE, stopwords = NULL)

instance_list_save(x, filename = tempfile())

instance_list_load(filename)

Arguments

`x`	A 'partition_bundle' object.
`...`	Arguments passed into 'get_token_stream()' call (e.g. argument 'subset' to apply stopwords).
`p_attribute`	Length-one 'character' vector, a positional attribute.
`verbose`	A 'logical' value, whether to be verbose.
`min_length`	Minimum length of documents after removing stopwords.
`vocabulary`	A 'character' vector with the vocabulary underlying input object 'x', in the correct order.
`docnames`	A 'character' vector with document names. Needs to have same length as input 'list' object. If missing, names of the input 'list' are used as docnames, if present.
`progress`	A 'logical' value, whether to show progress bar.
`regex`	A regular expression (length-one 'character' vector) used by Mallet Java code for splitting 'character' vector into tokens.
`tolower`	A 'logical' value, whether to lowercase tokens (performed) by Mallet Java code.
`stopwords`	Either a path with a plain text file with stopwords (one per line), or a 'character' vector.
`filename`	Where to store the Java-object.

Details

'instance_list_load()' will load a Java 'InstanceList' object that has been saved to disk (e.g. by using the 'instance_list_save()' function). The return value is a 'jobjRef' object.

Author(s)

Andreas Blaette, David Mimno

Examples

 
# Preparations: Create instance list

if (!mallet_is_installed()) mallet_install()
library(polmineR)
use("polmineR")

speeches <- polmineR::as.speeches(
  "GERMAPARLMINI", 
  s_attribute_name = "speaker", 
  s_attribute_date = "date"
)

instance_list <- as.instance_list(speeches)
lda <- BigTopicModel(
  instances = instance_list,
  n_topics = 25,
  alpha_sum = 5.1,
  beta = 0.1,
  threads = 1L,
  iterations = 150L
)

destfile <- tempfile()
lda$setSaveSerializedModel(50L, rJava::.jnew("java/lang/String", destfile))

lda$estimate()
lda$write(rJava::.jnew("java/io/File", destfile))

# Load topicmodel and turn it into LDA_Gibbs

lda2 <- mallet_load_topicmodel(destfile)
topicmodels_lda <- as_LDA(lda)
library(polmineR)
use("polmineR")
speeches <- as.speeches("GERMAPARLMINI", s_attribute_name = "speaker", s_attribute_date = "date")
speeches_instance_list <- as.instance_list(speeches, p_attribute = "word")

# Pass argument 'subset' to remove stopwords
terms_to_drop <- tm::stopwords("de")
speeches_instance_list <- as.instance_list(
  speeches,
  p_attribute = "word",
  subset = {!get(p_attribute) %in% bquote(.(terms_to_drop))}
)
data("AssociatedPress", package = "topicmodels")
il <- as.instance_list(AssociatedPress)
use("polmineR", corpus = "GERMAPARLMINI")

vocab <- p_attributes("GERMAPARLMINI", p_attribute = "word")

il <- corpus("GERMAPARLMINI") |>
  as.speeches(s_attribute_name = "speaker", s_attribute_date = "date") |>
  p_attributes(p_attribute = "word", decode = FALSE) |>
  as.instance_list(vocabulary = vocab)
instances <- as.speeches("GERMAPARLMINI", s_attribute_name = "speaker", s_attribute_date = "date") %>%
  get_token_stream(p_attribute = "word", collapse = " ") %>% 
  unlist() %>%
  as.instance_list()

PolMine/biglda documentation built on Feb. 25, 2023, 11:24 p.m.