knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction to quanteda

Building a corpus

Let's start by reading in the data from the abortion debate analyzed by Bara et al.
I've concatenated each speaker's contributions into a single file. (This is certainly not the only way to think about analyzing this data, but it's what Bara et al. did.)

First load the quanteda package and some other useful stuff

library(quanteda) # for general text analysis
library(dplyr)

library(iqmr)

then read in some text files and make a corpus from them

data("corpus_bara_speaker")

corpus are somewhat compex objects, designed for holding very large amounts of text. Here are the basics:

To get a basic summary of the texts and document variables

summary(corpus_bara_speaker)

Those word and sentence counts come from the ntype, (vocabulary size) ntoken (word count), and nsentence (sentence count) functions.

To extract the 12th speaker's contribution

corpus_bara_speaker[12]

Notice that this is a named character vector. (corpus_bara_speaker[13:14] would also work).

To get all the texts

texts(corpus_bara_speaker)

but maybe don't, or assign it to something, else it'll splurge into your console for a while.

To pull out one of the docvars, you can treat the corpus as if it has document variables as columns. Here are just the speaker voting decisions

corpus_bara_speaker[,'vote']

(corpus_bara_speaker[['vote']] also works, if you prefer to get a one column data frame).

To get the lot of them as a data frame

dvars <- docvars(corpus_bara_speaker)

To get just some of the documents there is a subset command that looks to the document variables to say which documents should be kept. Here are all the speakers that voted 'no' at the end of the debate and said more than 100 words during it

no_corp <- corpus_subset(corpus_bara_speaker, vote == "no" & ntoken(corpus_bara_speaker) > 100)
no_corp

(Goodbye Mr Mahon)

Finally, it's sometimes convenient to be able to switch between thinking sets of documents to sets of paragraphs, or even sentences.

para_corp <- corpus_reshape(corpus_bara_speaker, to = "paragraphs") # or "sentences"
head(summary(para_corp)) # Just the top few lines

Happily we can always reverse this process by changing to back to "documents".

Let's explore a little more by looking for the key terms in play.
One way to do this is to look for collocations. The collocation finder functions operates on the tokens of the corpus, so we extract them first

toks <- tokens(corpus_bara_speaker)
colls <- textstat_collocations(toks)
head(colls, 20)

This is disappointing unsubstantive, but if we work a bit harder we can get better results. First we'll remove those stopwords (leaving gaps where they were).

toks2 <- tokens_remove(toks, stopwords(), padding = TRUE)

Now rerun the function, maintaining the capitalization

coll2 <- textstat_collocations(toks2, tolower = FALSE, size = 2)
head(coll2, 20)

We can also ask for three word collocations

coll3 <- textstat_collocations(toks2, tolower = FALSE, size = 3)
head(coll3, 30)

If we're really serious about collocation hunting, it's probably best to use a dedicated package.

Let's get a bit more confirmatory in our text analysis

Keywords in context

Since this is an abortion debate, let's see the honorable folk talk about mothers and babies. We'll use the `keyword in context' function

kw_mother <- kwic(corpus_bara_speaker, "mother*", window = 10)  # 10 words either side of every instance
head(kw_mother)

KWICs can get quite large, but if you want to see it all

kw_mother

will open a browser with the whole thing.

There is much less talk of babies than of mothers. In this debate, the other major actors are doctors and their professional association, which you can investigate the same way.

The output of kwic is simply a data frame, so one thing that's often useful is to treat the left and right sides of the kwic as a document (about babies), e.g. like this:

babes <- kwic(corpus_bara_speaker, "babi*", window = 10)
txt <- paste(babes$pre, babes$post, collapse = " ") # make one big string

This constructed document contains, by definition, all the ways the term was used in the corpus, so you can then examine what sorts of words tend to be used around it, e.g. by using a content analysis dictionary.

As far as I know this was pioneered by the Yoshikoder software, and bears some obvious similarities to the word-embedding analyses that are currently fashionable.

Now let's get even more confirmatory, and apply the mapping between words and topics described by Bara et al.'s dictionary to this corpus, in order to replicate their first analysis.

Our first stem will be to create a document feature matrix (dfm), after which dictionary application is straightforward. But since lots of models require a dfm, we'll linger a little on the steps of the process.

Constructing a document feature matrix

quanteda makes a basic dfm quite straightforward

corpdfm <- dfm(corpus_bara_speaker) # lowercases by default, but not much more
dim(corpdfm)
featnames(corpdfm)[1:40] # really just colnames

But let's remove some things that aren't (currently) of interest to us

corpdfm <- dfm(corpus_bara_speaker, 
               remove = stopwords(), 
               remove_punct = TRUE,
               remove_numbers = TRUE)
dim(corpdfm) # a bit smaller
featnames(corpdfm)[1:40]

We could also stem

stemdfm <- dfm(corpus_bara_speaker, 
               remove = stopwords(), 
               remove_punct = TRUE,
               remove_numbers = TRUE, 
               stem = TRUE)
dim(stemdfm) # about 1000 fewer 'word's
featnames(stemdfm)[1:40]

but our dictionary entries aren't stemmed, so let's save stemdfm for later.

For modeling, we'll often want to remove the low frequency and idiosyncratic words

smallcorpdfm <- dfm_trim(corpdfm, min_termfreq = 5, min_docfreq = 5)
dim(smallcorpdfm)

where min_count removes any word that occurs less than 5 times and min_docfreq removes any words that occurs any number of times but in fewer than 5 documents. That makes things a fair bit smaller. But again, we don't want to miss dictionary entries.

There's also a wordcloud function for viewing the the document feature matrix, but we won't use it because wordclouds are silly.

Answering questions with text

In the debate the Speaker, Mr Horace King, said he would try to give equal time to both sides of the debate. Did it happen this way?

It's hard to know whether the debate was persuasive since we do not know the speakers prior beliefs (though we could find out from their previous debates) so let us assume that there was no substantial persuasion and we'll assume that no speaker spoke particularly slowly.
These imply that we can proxy speaking time with number of words said, and side with final vote.

df <- docvars(corpus_bara_speaker)
df$word_count <- ntoken(corpus_bara_speaker)
df %>%
  group_by(vote) %>%
  summarise(floortime = sum(word_count),
            speakers = length(vote))

Although three quarters of the speakers voted "yes", it seems that floor time was about two to one "yes" to "no" voters.

Applying a content analysis dictionary

Let's turn to the content analysis dictionary that Bara used. A content analysis dictionary in quanteda terms can be made out of a list of vectors of words, like this:

dictionary(list(medics = c("doctor", "medical", "hospital"),
                mothers = c("mother", "parents")))

or imported in the format of some other content analysis program. (Quanteda can deal with dictionaries from Wordstat, Yoshikoder, and LIWC). We'll use a version of the dictionary that's bundled with the iqmr package

data(dict_baraetal)

Replicating a little bit of Bara

With dictionary in hand we can now go category counting rather than word counting

baradfm <- dfm(corpus_bara_speaker, dictionary = dict_baraetal)

Since this output is not absolutely massive

dim(baradfm)

let's force it into a regular R matrix to take a look at the whole thing without being swamped in elements

dictout <- as.matrix(baradfm)
dictout

And recreate some of Bara et al.'s Table 3, repeated below, as a bar plot.

tab_baraetal <- data.frame(Mean = c(13.59, 7.82, 21.71, 4.61, 32.17, 20.09),
                           SD = c(2.98, 3.36, 4.73, 2.51, 6.94, 4.86),
                           row.names = c("advocacy", "legal", "medical", 
                                         "moral", "procedural", "social" ),
                           stringsAsFactors = FALSE)
knitr::kable(t(tab_baraetal))
prop_emph <- (dictout / rowSums(dictout))
tab <- data.frame(Mean = 100 * apply(prop_emph, 2, mean),
                  SD = 100 * apply(prop_emph, 2, sd),
                  row.names = c("advocacy", "legal", "medical", "moral", "procedural", "social" ),
                  stringsAsFactors = FALSE)
knitr::kable(t(tab))

Finally, let's revisit the floortime question but this time counting only vocabulary that Bara et al. thought was substantively relevant.

df$relevant_word_count <- rowSums(dictout)
df %>%
  group_by(vote) %>%
  summarise(floortime = sum(relevant_word_count),
            speakers = length(vote))

Not much change there.



conjugateprior/iqmr documentation built on May 31, 2019, 7:41 a.m.