Home

/

GitHub

/

In conjugateprior/iqmr: Materials for IQMR's computer assisted text analysis course

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

In this lab we're going to investigate a different debate on abortion law, this time in the United States. The debate occurred on October 21, 2003 as the last stage of the passage of the 'Partial-Birth Abortion Ban Act of 2003' that banned specific forms of abortion.

There are 23 speakers and we will be interested in simultaneously discovering the main topics of the debate and decomposing each speaker's (and party's) patterns of emphasis over those topics. We'll use a representation of the debate that bundles all speaker contributions together, scraped from the plain-ish text versions offered by congress.gov. The process of extracting this information is detailed in another lab.

However, we will usually need more than 23 documents to get a got topic model going. Accordingly we'll treat a smaller unit as our documents - it will be whatever ther Congressional Record thinks should be a paragraph.

Here are the raw materials

library(iqmr)

data("df_pba_debate_by_speaker")
names(df_pba_debate_by_speaker)

we'll make a corpus from them, using the speaker's name as a document identifier and we'll also explicitly add the speaker names as a document variable because we'll want to refer to them later.

corp <- corpus(df_pba_debate_by_speaker, 
               text_field = "contributions", 
               docid_field = "speaker")
docvars(corp, "speaker") <- df_pba_debate_by_speaker$speaker # add speaker field
summary(corp)

Now we'll do the switch to paragraphs

para_corp <- corpus_reshape(corp, to = "paragraphs") 
head(summary(para_corp))

Notice that the texts now have a numerical identifier added and the other document variables have been duplicated appropriately.

Although contributions are probably too big, ironically paragraphs run the risk of being too small (or even disappearing if the paragraph spotting algorithm is imperfect). Let's check

table(ntoken(para_corp)) # 15 documents with, err, no words

so we'll trim these tiny documents. Then we'll make a document term matrix for the topic model to work with

para_corp <- corpus_subset(para_corp, ntoken(para_corp) > 2)

para_dfm <- dfm(para_corp,
  remove_punct = TRUE,
  remove = stopwords(), 
  remove_numbers = TRUE)

To start with we'll fit a 10 topic topic model using the stm package

library(stm)

mod <- stm(para_dfm, K = 10, seed = 12345)

library(stm)

mod <- stm(para_dfm, K = 10, seed = 12345)

Don't forget to set the seed so others can replicate your work.

Now let's take a look at the topic word associations we've estimated.

labelTopics(mod)

For putting in online appendices and showing to your co-authors, a plot is often more compelling

plot(mod, type = "labels", labeltype = "prob") # or frex, lift, score

You may need to pop out the window to make this big enough to view.

To get a better sense of a topic it's sometimes helpful to look at a sample of documents a lot of whose tokens were assigned to it. Here we find documents (for us paragraphs) that have a lot of topic 1.

findThoughts(mod, texts = texts(para_corp), topics = 1)

This should be mostly about issues of constitutionality. Try a few others (but be aware that topic 8 is deliberately gruesome).

Model checking

We would be remiss not to run some basic model diagnostics. Here are three, conveniently built into the package. Outside the lab, we'd spend a lot more time on this part!

First, we'd prefer it if words were not exclusively generated by one topic (we're trying to abstract up from the words after all)

checkBeta(mod)$problemWords # we'll look at the 'problem words field'

On the other hand we'd like a certain amount of exclusivity to the topics.
Exclusivity here means how much each each topic has its own vocabulary not used by other topics

dotchart(exclusivity(mod), labels = 1:10)

We'd also like the topics to be coherent. If the topics are in fact semantically coherent, one thing we'd expect is that words a topic generates should co-occur often in the same document. One measure of this is

cohere <- semanticCoherence(mod, para_dfm)
dotchart(cohere, labels = 1:10)

and it looks like we might want to check topics 2 and 7. Typically semantically 'incoherent' vocabulary are stop words (which we removed), and procedural vocabulary. Here it seems like they are just more general terms in the debate topic that everyone is happy to use.

Predicting topic prevalences

We may reasonably expect that both parties and speakers would prefer to emphasise a different set of topics. We have two options to connect speaker and party identifiers to topics.

First, we can manually extract the topic proportions from the model and use parts of them as a dependent variable. The topic proportions for each document are inside the model object

head(mod$theta)

We can then tie these to the document variables and get a high level view of who talks about what. Who tends to use topic 1?

df <- data.frame(topic1 = mod$theta[,1], docvars(para_dfm))
head(df)

Fit a simple random effects model in which speakers are a draw from a population of legislators

library(lme4)
library(lattice)

lmer_topic1 <- lmer(topic1 ~ (1 | speaker), data = df)
dotplot(ranef(lmer_topic1, condVar = TRUE))

Apparently Cantell and Feinstein are the topic 1 users here, possibly also Mikulski and Nelso, but we're fairly unsure about that.

Let STM do both steps

We can also fit the topic model taking into account that we think we know what factors might predict different topic prevalence. This time, let's look at how party relates to topic prevalence.

We need to add two bits of information to the model specification: the the document variables we think are predictive, and the functional form of the relationship between them and topics. The document variables come in as a data argument and the functional form is specified as the right hand side of a formula. To say that party affiliation predicts topic prevalence we say prevalence = ~ party, and fit the model as before

mod2 <- stm(para_dfm, K = 10, 
           prevalence = ~ party, data = docvars(para_dfm),
           seed = 12345)

mod2 <- stm(para_dfm, K = 10, 
           prevalence = ~ party, data = docvars(para_dfm),
           seed = 12345)

If we are interested in topics (constitutional) 1 and (gruesome) 8, we can focus in on their relationship to party like this:

gg <- estimateEffect(c(1,8) ~ party, mod2, docvars(para_dfm))
summary(gg)

Note here that the baseline topic prevalence is Allard (R). By changing the numbers in the 'dependent variable' part of the formula first argument we can look at as many of the topics as we like.

As always, a plot is nicer

plot(gg, "party", labeltype = "prob", model = mod2)

From this we can see that Republican speakers use topic 1 somewhat less than Democrats, but topic 8 much more.

If you'd like to know whether this is a general Republican thing or just a quirk of one or two speakers, adjust the model above to use speaker as the predictor and follow the steps above.

In general you can use any specification you like in the formula, so you don't have to choose individual covariates, as we did here. In particular it's often nice to 'control' for time in a smooth fashion. For example if you have a year variable and enough years you might use party + s(year) as the formula, which would allow topic prevalences to move smoothly over time, and the party 'effect' to be estimated against this background.

Note also that formulae need to know about nesting in data, so party/speaker would be the way to express that speakers are in one party only.