word_weights: Process large topic word weights matrices

Description Usage Arguments Details Examples

Description

The word weights matrix (weights of words for topics) can get big dataish when there is a large number of topics and a substantially sized vocabulary. The mallet_save_word_weights and the mallet_load_word_weights are tools to handle this scenario by writing out the data to disk as a sparse matrix, and loading this into the R session. In order to be able to use the function, the ParallelTopicModel class needs to be used, the RTopicModel will not do it.

Usage

1
2
3

Arguments

filename

A file with word weights.

model

A topic model (class jobjRef).

destfile

Length-one character vector, the filename of the output file.

Details

The function mallet_save_word_weights will write a file that can be handled as a sparse matrix to a file (argument destfile). Internally, it uses the method printTopicWordWeights of the ParallelTopicModel class. The (parsed) content of the file is equivalent to matrix that can be obtained directly the class using the getTopicWords(FALSE, TRUE) method. Thus, values are not normalised, but smoothed (= coefficient beta is added to values).

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
## Not run: 
polmineR::use("polmineR")
speeches <- polmineR::as.speeches("GERMAPARLMINI", s_attribute_name = "speaker")

library(rJava)
.jinit()
.jaddClassPath("/opt/mallet-2.0.8/class") # after .jinit(), not before
.jaddClassPath("/opt/mallet-2.0.8/lib/mallet-deps.jar")
instance_list <- topicanalysis::mallet_make_instance_list(speeches)
instancefile <- mallet_instance_list_store(instance_list)

lda <- mallet::MalletLDA(num.topics = 20)
lda$loadDocuments(instance_list)
lda$setAlphaOptimization(20, 50)
lda$train(100)

# This is the call used internally by 'as_LDA()'. The difference
# is that the arguments of the $getTopicWords()-method are FALSE 
# (argument 'normalized') and TRUE (argument 'smoothed')
beta_1 <- rJava::.jevalArray(lda$getTopicWords(FALSE, TRUE), simplify = TRUE) 
alphabet <- strsplit(lda$getAlphabet()$toString(), "\n")[[1]]
colnames(beta_1) <- alphabet
beta_1 <- beta_1[, alphabet[order(alphabet)] ]
rownames(beta_1) <- as.character(1:nrow(beta_1))

# This is an approach that uses a (temporary) file written
# to disk. The advantage is that it is a sparse matrix that is
# passed
fname <- mallet_save_word_weights(lda)
word_weights <- mallet_load_word_weights(fname)
beta_2 <- t(as.matrix(word_weights))

# Demonstrate the equivalence of the two approaches
identical(rownames(beta_1), rownames(beta_2))
identical(colnames(beta_1), colnames(beta_2))
identical(apply(beta_1, 1, order), apply(beta_2, 1, order))
identical(beta_1, beta_2)

## End(Not run)

PolMine/polmineR.topics documentation built on March 6, 2020, 6:03 p.m.