BTM
In oolong: Create Validation Tests for Automated Content Analysis

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
  )
set.seed(46709394)

The package BTM by Jan Wijffels et al. finds "topics in collections of short text". Compared to other topic model packages, BTM requires a special data format for training. Oolong has no problem generating word intrusion tests with BTM. However, that special data format can make creation of topic intrusion tests very tricky.

This guide provides our recommendations on how to use BTM, so that the model can be used for generating topic intrusion tests.

Requirement #1: Keep your quanteda corpus

It is because every document has a unique document id.

require(BTM)
require(quanteda)
require(oolong)
trump_corpus <- corpus(trump2k)

And then you can do regular text cleaning, stemming procedure with quanteda. Instead of making the product a DFM object, make it a token object. You may read this issue by Benoit et al.

tokens(trump_corpus, remove_punct = TRUE, remove_numbers = TRUE, remove_symbols = TRUE, split_hyphens = TRUE, remove_url = TRUE) %>% tokens_tolower() %>% tokens_remove(stopwords("en")) %>% tokens_remove("@*")  -> trump_toks

Requirement #2: Keep your data frame

Use this function to convert the token object to a data frame.

as.data.frame.tokens <- function(x) {
  data.frame(
    doc_id = rep(names(x), lengths(x)),
    tokens = unlist(x, use.names = FALSE)
  )
}

trump_dat <- as.data.frame.tokens(trump_toks)

Train a BTM model

trump_btm <- BTM(trump_dat, k = 8, iter = 500, trace = 10)

Pecularities of BTM

This is how you should generate $\theta_{t}$ . However, there are many NaN and there are only 1994 rows (trump2k has 2000 tweets) due to empty documents.

theta <- predict(trump_btm, newdata = trump_dat)
dim(theta)

setdiff(docid(trump_corpus), row.names(theta))

trump_corpus[604]

Also, the row order is messed up.

head(row.names(theta), 100)

Oolong's support for BTM

Oolong has no problem generating word intrusion test for BTM like you do with other topic models.

oolong <- create_oolong(trump_btm)
oolong

For generating topic intrusion tests, however, you must provide the data frame you used for training (in this case trump_dat). Your input_corpus must be a quanteda corpus too.

oolong <- create_oolong(trump_btm, trump_corpus, btm_dataframe = trump_dat)
oolong

btm_dataframe must not be NULL.

oolong <- create_oolong(trump_btm, trump_corpus)

input_corpus must be a quanteda corpus.

oolong <- create_oolong(trump_btm, trump2k, btm_dataframe = trump_dat)

Any scripts or data that you put into this service are public.

oolong documentation built on Aug. 25, 2023, 5:16 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

oolong
Create Validation Tests for Automated Content Analysis

BTM
In oolong: Create Validation Tests for Automated Content Analysis

Requirement #1: Keep your quanteda corpus

Requirement #2: Keep your data frame

Pecularities of BTM

Oolong's support for BTM

Try the oolong package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

oolong Create Validation Tests for Automated Content Analysis

BTM In oolong: Create Validation Tests for Automated Content Analysis

Requirement #1: Keep your quanteda corpus

Requirement #2: Keep your data frame

Pecularities of BTM

Oolong's support for BTM

Try the oolong package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

oolong
Create Validation Tests for Automated Content Analysis

BTM
In oolong: Create Validation Tests for Automated Content Analysis