knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
quanteda
Let's start by reading in the data from the abortion debate analyzed by Bara et al.
I've concatenated each speaker's contributions into a single file.
(This is certainly not the only way to think about analyzing this data,
but it's what Bara et al. did.)
First load the quanteda package and some other useful stuff
library(quanteda) # for general text analysis library(dplyr) library(iqmr)
then read in some text files and make a corpus
from them
data("corpus_bara_speaker")
corpus
are somewhat compex objects, designed for holding very large amounts
of text. Here are the basics:
To get a basic summary of the texts and document variables
summary(corpus_bara_speaker)
Those word and sentence counts come from the ntype
, (vocabulary size)
ntoken
(word count), and nsentence
(sentence count)
functions.
To extract the 12th speaker's contribution
corpus_bara_speaker[12]
Notice that this is a named character vector. (corpus_bara_speaker[13:14]
would also work).
To get all the texts
texts(corpus_bara_speaker)
but maybe don't, or assign it to something, else it'll splurge into your console for a while.
To pull out one of the docvars, you can treat the corpus as if it has document variables as columns. Here are just the speaker voting decisions
corpus_bara_speaker[,'vote']
(corpus_bara_speaker[['vote']]
also works, if you prefer to get a one column data frame).
To get the lot of them as a data frame
dvars <- docvars(corpus_bara_speaker)
To get just some of the documents there is a subset command that looks to the document variables to say which documents should be kept. Here are all the speakers that voted 'no' at the end of the debate and said more than 100 words during it
no_corp <- corpus_subset(corpus_bara_speaker, vote == "no" & ntoken(corpus_bara_speaker) > 100) no_corp
(Goodbye Mr Mahon)
Finally, it's sometimes convenient to be able to switch between thinking sets of documents to sets of paragraphs, or even sentences.
para_corp <- corpus_reshape(corpus_bara_speaker, to = "paragraphs") # or "sentences" head(summary(para_corp)) # Just the top few lines
Happily we can always reverse this process by changing to
back to "documents".
Let's explore a little more by looking for the key terms in play.
One way to do this is to look for collocations. The collocation finder functions operates on the tokens of the corpus, so we extract them first
toks <- tokens(corpus_bara_speaker) colls <- textstat_collocations(toks) head(colls, 20)
This is disappointing unsubstantive, but if we work a bit harder we can get better results. First we'll remove those stopwords (leaving gaps where they were).
toks2 <- tokens_remove(toks, stopwords(), padding = TRUE)
Now rerun the function, maintaining the capitalization
coll2 <- textstat_collocations(toks2, tolower = FALSE, size = 2) head(coll2, 20)
We can also ask for three word collocations
coll3 <- textstat_collocations(toks2, tolower = FALSE, size = 3) head(coll3, 30)
If we're really serious about collocation hunting, it's probably best to use a dedicated package.
Let's get a bit more confirmatory in our text analysis
Since this is an abortion debate, let's see the honorable folk talk about mothers and babies. We'll use the `keyword in context' function
kw_mother <- kwic(corpus_bara_speaker, "mother*", window = 10) # 10 words either side of every instance head(kw_mother)
KWICs can get quite large, but if you want to see it all
kw_mother
will open a browser with the whole thing.
There is much less talk of babies than of mothers. In this debate, the other major actors are doctors and their professional association, which you can investigate the same way.
The output of kwic
is simply a data frame, so one thing that's often useful
is to treat the left and right sides of the kwic as a document (about babies),
e.g. like this:
babes <- kwic(corpus_bara_speaker, "babi*", window = 10) txt <- paste(babes$pre, babes$post, collapse = " ") # make one big string
This constructed document contains, by definition, all the ways the term was used in the corpus, so you can then examine what sorts of words tend to be used around it, e.g. by using a content analysis dictionary.
As far as I know this was pioneered by the Yoshikoder software, and bears some obvious similarities to the word-embedding analyses that are currently fashionable.
Now let's get even more confirmatory, and apply the mapping between words and topics described by Bara et al.'s dictionary to this corpus, in order to replicate their first analysis.
Our first stem will be to create a document feature matrix (dfm
), after
which dictionary application is straightforward. But since lots of models require
a dfm, we'll linger a little on the steps of the process.
quanteda makes a basic dfm quite straightforward
corpdfm <- dfm(corpus_bara_speaker) # lowercases by default, but not much more dim(corpdfm) featnames(corpdfm)[1:40] # really just colnames
But let's remove some things that aren't (currently) of interest to us
corpdfm <- dfm(corpus_bara_speaker, remove = stopwords(), remove_punct = TRUE, remove_numbers = TRUE) dim(corpdfm) # a bit smaller featnames(corpdfm)[1:40]
We could also stem
stemdfm <- dfm(corpus_bara_speaker, remove = stopwords(), remove_punct = TRUE, remove_numbers = TRUE, stem = TRUE) dim(stemdfm) # about 1000 fewer 'word's featnames(stemdfm)[1:40]
but our dictionary entries aren't stemmed, so let's save stemdfm
for later.
For modeling, we'll often want to remove the low frequency and idiosyncratic words
smallcorpdfm <- dfm_trim(corpdfm, min_termfreq = 5, min_docfreq = 5) dim(smallcorpdfm)
where min_count
removes any word that occurs less than 5 times and min_docfreq
removes any words that occurs any number of times but in fewer than 5 documents. That makes things a fair bit smaller. But again, we don't want to miss dictionary entries.
There's also a wordcloud
function for viewing the the document feature matrix,
but we won't use it because wordclouds are silly.
In the debate the Speaker, Mr Horace King, said he would try to give equal time to both sides of the debate. Did it happen this way?
It's hard to know whether the debate was persuasive since we do not know the speakers prior beliefs (though we could find out from their previous debates) so let us assume that there was no substantial persuasion and we'll assume that no speaker spoke particularly slowly.
These imply that we can proxy speaking time with number of words said,
and side with final vote.
df <- docvars(corpus_bara_speaker) df$word_count <- ntoken(corpus_bara_speaker) df %>% group_by(vote) %>% summarise(floortime = sum(word_count), speakers = length(vote))
Although three quarters of the speakers voted "yes", it seems that floor time was about two to one "yes" to "no" voters.
Let's turn to the content analysis dictionary that Bara used.
A content analysis dictionary in quanteda
terms can be made out of a list of vectors of words, like this:
dictionary(list(medics = c("doctor", "medical", "hospital"), mothers = c("mother", "parents")))
or imported in the format of some other content analysis program.
(Quanteda can deal with dictionaries from Wordstat, Yoshikoder, and LIWC).
We'll use a version of the dictionary that's bundled with the iqmr
package
data(dict_baraetal)
With dictionary in hand we can now go category counting rather than word counting
baradfm <- dfm(corpus_bara_speaker, dictionary = dict_baraetal)
Since this output is not absolutely massive
dim(baradfm)
let's force it into a regular R matrix to take a look at the whole thing without being swamped in elements
dictout <- as.matrix(baradfm) dictout
And recreate some of Bara et al.'s Table 3, repeated below, as a bar plot.
tab_baraetal <- data.frame(Mean = c(13.59, 7.82, 21.71, 4.61, 32.17, 20.09), SD = c(2.98, 3.36, 4.73, 2.51, 6.94, 4.86), row.names = c("advocacy", "legal", "medical", "moral", "procedural", "social" ), stringsAsFactors = FALSE) knitr::kable(t(tab_baraetal))
prop_emph <- (dictout / rowSums(dictout)) tab <- data.frame(Mean = 100 * apply(prop_emph, 2, mean), SD = 100 * apply(prop_emph, 2, sd), row.names = c("advocacy", "legal", "medical", "moral", "procedural", "social" ), stringsAsFactors = FALSE)
knitr::kable(t(tab))
Finally, let's revisit the floortime question but this time counting only vocabulary that Bara et al. thought was substantively relevant.
df$relevant_word_count <- rowSums(dictout) df %>% group_by(vote) %>% summarise(floortime = sum(relevant_word_count), speakers = length(vote))
Not much change there.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.