knitr::opts_chunk$set(collapse = FALSE, comment = "##")
Since quanteda is available on CRAN, you can install by using your GUI's R package installer, or execute:
install.packages("quanteda")
See an instructions at https://github.com/quanteda/quanteda to install the GitHub version,
The following packages work well with or extend quanteda and we recommend that you also install them:
r
devtools::install_github("quanteda/quanteda.corpora")
r
devtools::install_github("kbenoit/LIWCalike")
You load the package to access to functions and data in the package.
library(quanteda)
quanteda has a simple and powerful companion package for loading texts: readtext. The main function in this package, readtext()
, takes a file or fileset from disk or a URL, and returns a type of data.frame that can be used directly with the corpus()
constructor function, to create a quanteda corpus object.
readtext()
works on:
.txt
) files;.csv
) files;The corpus constructor command corpus()
works directly on:
VCorpus
corpus object from the tm package.The simplest case is to create a corpus from a vector of texts already in memory in R. This gives the advanced R user complete flexbility with his or her choice of text inputs, as there are almost endless ways to get a vector of texts into R.
If we already have the texts in this form, we can call the corpus constructor function directly. We can demonstrate this on the built-in character object of the texts about immigration policy extracted from the 2010 election manifestos of the UK political parties (called data_char_ukimmig2010
).
my_corpus <- corpus(data_char_ukimmig2010) # build a new corpus from the texts summary(my_corpus)
If we wanted, we could add some document-level variables -- what quanteda calls docvars -- to this corpus.
We can do this using the R's names()
function to get the names of the character vector data_char_ukimmig2010
, and assign this to a document variable (docvar).
docvars(my_corpus, "Party") <- names(data_char_ukimmig2010) docvars(my_corpus, "Year") <- 2010 summary(my_corpus)
If we wanted to tag each document with additional meta-data not considered a document variable of interest for analysis, but rather something that we need to know as an attribute of the document, we could also add those to our corpus.
metadoc(my_corpus, "language") <- "english" metadoc(my_corpus, "docsource") <- paste("data_char_ukimmig2010", 1:ndoc(my_corpus), sep = "_") summary(my_corpus, showmeta = TRUE)
The last command, metadoc
, allows you to define your own document meta-data fields. Note that in assiging just the single value of "english"
, R has recycled the value until it matches the number of documents in the corpus. In creating
a simple tag for our custom metadoc field docsource
, we used the quanteda function ndoc()
to retrieve
the number of documents in our corpus. This function is deliberately designed to work in a way similar to
functions you may already use in R, such as nrow()
and ncol()
.
require(readtext) # Twitter json mytf1 <- readtext("~/Dropbox/QUANTESS/social media/zombies/tweets.json") my_corpusTwitter <- corpus(mytf1) summary(my_corpusTwitter, 5) # generic json - needs a textfield specifier mytf2 <- readtext("~/Dropbox/QUANTESS/Manuscripts/collocations/Corpora/sotu/sotu.json", textfield = "text") summary(corpus(mytf2), 5) # text file mytf3 <- readtext("~/Dropbox/QUANTESS/corpora/project_gutenberg/pg2701.txt", cache = FALSE) summary(corpus(mytf3), 5) # multiple text files mytf4 <- readtext("~/Dropbox/QUANTESS/corpora/inaugural/*.txt", cache = FALSE) summary(corpus(mytf4), 5) # multiple text files with docvars from filenames mytf5 <- readtext("~/Dropbox/QUANTESS/corpora/inaugural/*.txt", docvarsfrom = "filenames", sep = "-", docvarnames = c("Year", "President")) summary(corpus(mytf5), 5) # XML data mytf6 <- readtext("~/Dropbox/QUANTESS/quanteda_working_files/xmlData/plant_catalog.xml", textfield = "COMMON") summary(corpus(mytf6), 5) # csv file write.csv(data.frame(inaugSpeech = texts(data_corpus_inaugural), docvars(data_corpus_inaugural)), file = "/tmp/inaug_texts.csv", row.names = FALSE) mytf7 <- readtext("/tmp/inaug_texts.csv", textfield = "inaugSpeech") summary(corpus(mytf7), 5)
A corpus is designed to be a "library" of original documents that have been converted to plain, UTF-8 encoded text, and stored along with meta-data at the corpus level and at the document-level. We have a special name for document-level meta-data: docvars. These are variables or features that describe attributes of each document.
A corpus is designed to be a more or less static container of texts with respect to processing and analysis. This means that the texts in corpus are not designed to be changed internally through (for example) cleaning or pre-processing steps, such as stemming or removing punctuation. Rather, texts can be extracted from the corpus as part of processing, and assigned to new objects, but the idea is that the corpus will remain as an original reference copy so that other analyses -- for instance those in which stems and punctuation were required, such as analyzing a reading ease index -- can be performed on the same corpus.
To extract texts from a corpus, we use an extractor, called texts()
.
texts(data_corpus_inaugural)[2]
To summarize the texts from a corpus, we can call a summary()
method defined for a corpus.
data(data_corpus_irishbudget2010, package = "quanteda.corpora") summary(data_corpus_irishbudget2010)
We can save the output from the summary command as a data frame, and plot some basic descriptive statistics with this information:
tokenInfo <- summary(data_corpus_inaugural) if (require(ggplot2)) ggplot(data=tokenInfo, aes(x = Year, y = Tokens, group = 1)) + geom_line() + geom_point() + scale_x_continuous(labels = c(seq(1789, 2017, 12)), breaks = seq(1789, 2017, 12)) + theme_bw() # Longest inaugural address: William Henry Harrison tokenInfo[which.max(tokenInfo$Tokens), ]
The +
operator provides a simple method for concatenating two corpus objects. If they contain different sets of document-level variables, these will be stitched together in a fashion that guarantees that no information is lost. Corpus-level medata data is also concatenated.
library(quanteda) my_corpus1 <- corpus(data_corpus_inaugural[1:5]) my_corpus2 <- corpus(data_corpus_inaugural[53:58]) my_corpus3 <- my_corpus1 + my_corpus2 summary(my_corpus3)
There is a method of the corpus_subset()
function defined for corpus objects, where a new corpus can be extracted based on logical conditions applied to docvars:
summary(corpus_subset(data_corpus_inaugural, Year > 1990)) summary(corpus_subset(data_corpus_inaugural, President == "Adams"))
The kwic
function (keywords-in-context) performs a search for a word and allows us to view the contexts in which it occurs:
kwic(data_corpus_inaugural, "terror")
kwic(data_corpus_inaugural, "terror", valuetype = "regex")
kwic(data_corpus_inaugural, "communist*")
In the above summary, Year
and President
are variables associated with each document. We can access such variables with the docvars()
function.
# inspect the document-level variables head(docvars(data_corpus_inaugural)) # inspect the corpus-level metadata metacorpus(data_corpus_inaugural)
More corpora are available from the quanteda.corpora package.
In order to perform statistical analysis such as document scaling, we must extract a matrix associating values for certain features with each document. In quanteda, we use the dfm
function to produce such a matrix. "dfm" is short for document-feature matrix, and always refers to documents in rows and "features" as columns. We fix this dimensional orientation because it is standard in data analysis to have a unit of analysis as a row, and features or variables
pertaining to each unit as columns. We call them "features" rather than terms, because features are more general than terms: they can be defined as raw terms, stemmed terms, the parts of speech of terms, terms after stopwords have been removed, or a dictionary class to which a term belongs. Features can be entirely general, such as ngrams or syntactic dependencies, and we leave this open-ended.
To simply tokenize a text, quanteda provides a powerful command called tokens()
. This produces an intermediate object, consisting of a list of tokens in the form of character vectors, where each element of the list corresponds to an input document.
tokens()
is deliberately conservative, meaning that it does not remove anything from the text unless told to do so.
txt <- c(text1 = "This is $10 in 999 different ways,\n up and down; left and right!", text2 = "@kenbenoit working: on #quanteda 2day\t4ever, http://textasdata.com?page=123.") tokens(txt) tokens(txt, remove_numbers = TRUE, remove_punct = TRUE) tokens(txt, remove_numbers = FALSE, remove_punct = TRUE) tokens(txt, remove_numbers = TRUE, remove_punct = FALSE) tokens(txt, remove_numbers = FALSE, remove_punct = FALSE)
We also have the option to tokenize characters:
tokens("Great website: http://textasdata.com?page=123.", what = "character") tokens("Great website: http://textasdata.com?page=123.", what = "character", remove_separators = FALSE)
and sentences:
# sentence level tokens(c("Kurt Vongeut said; only assholes use semi-colons.", "Today is Thursday in Canberra: It is yesterday in London.", "En el caso de que no puedas ir con ellos, ¿quieres ir con nosotros?"), what = "sentence")
Tokenizing texts is an intermediate option, and most users will want to skip straight to constructing
a document-feature matrix. For this, we have a Swiss-army knife function, called dfm()
, which performs
tokenization and tabulates the extracted features into a matrix of documents by features. Unlike
the conservative approach taken by tokens()
, the dfm()
function applies certain options by default,
such as tolower()
-- a separate function for lower-casing texts -- and removes punctuation. All of the options to tokens()
can be passed to dfm()
, however.
my_corpus <- corpus_subset(data_corpus_inaugural, Year > 1990) # make a dfm my_dfm <- dfm(my_corpus) my_dfm[, 1:5]
Other options for a dfm()
include removing stopwords, and stemming the tokens.
# make a dfm, removing stopwords and applying stemming myStemMat <- dfm(my_corpus, remove = stopwords("english"), stem = TRUE, remove_punct = TRUE) myStemMat[, 1:5]
The option remove
provides a list of tokens to be ignored. Most users will
supply a list of pre-defined "stop words", defined for numerous languages, accessed through
the stopwords()
function:
head(stopwords("english"), 20) head(stopwords("russian"), 10) head(stopwords("arabic"), 10)
The dfm can be inspected in the Enviroment pane in RStudio, or by calling R's View
function. Calling plot
on a dfm will display a wordcloud.
my_dfm <- dfm(data_char_ukimmig2010, remove = stopwords("english"), remove_punct = TRUE) my_dfm
To access a list of the most frequently occurring features, we can use topfeatures()
:
topfeatures(my_dfm, 20) # 20 top words
Plotting a word cloud is done using textplot_wordcloud()
, for a dfm
class object. This function passes arguments through to wordcloud()
from the wordcloud package, and can prettify the plot using the same arguments:
set.seed(100) textplot_wordcloud(my_dfm, min_count = 6, random_order = FALSE, rotation = .25, colors = RColorBrewer::brewer.pal(8,"Dark2"))
Often, we are interested in analysing how texts differ according to substantive factors which may be encoded in the document variables, rather than simply by the boundaries of the document files. We can group documents which share the same value for a document variable when creating a dfm:
by_party_dfm <- dfm(data_corpus_irishbudget2010, groups = "party", remove = stopwords("english"), remove_punct = TRUE)
We can sort this dfm, and inspect it:
dfm_sort(by_party_dfm)[, 1:10]
For some applications we have prior knowledge of sets of words that are indicative of traits we would like to measure from the text. For example, a general list of positive words might indicate positive sentiment in a movie review, or we might have a dictionary of political terms which are associated with a particular ideological stance. In these cases, it is sometimes useful to treat these groups of words as equivalent for the purposes of analysis, and sum their counts into classes.
For example, let's look at how words associated with terrorism and words associated with the economy vary by President in the inaugural speeches corpus. From the original corpus, we select Presidents since Clinton:
recent_corpus <- corpus_subset(data_corpus_inaugural, Year > 1991)
Now we define a demonstration dictionary:
my_dict <- dictionary(list(terror = c("terrorism", "terrorists", "threat"), economy = c("jobs", "business", "grow", "work")))
We can use the dictionary when making the dfm:
by_pres_mat <- dfm(recent_corpus, dictionary = my_dict) by_pres_mat
The constructor function dictionary()
also works with two common "foreign" dictionary formats: the
LIWC and Provalis Research's Wordstat format. For instance, we can load the LIWC and apply this to the Presidential inaugural speech corpus:
liwcdict <- dictionary(file = "~/Dropbox/QUANTESS/dictionaries/LIWC/LIWC2001_English.dic", format = "LIWC") liwcdfm <- dfm(data_corpus_inaugural[52:58], dictionary = liwcdict) liwcdfm[, 1:10]
pres_dfm <- dfm(corpus_subset(data_corpus_inaugural, Year > 1980), remove = stopwords("english"), stem = TRUE, remove_punct = TRUE) obama_simil <- textstat_simil(pres_dfm, pres_dfm[c("2009-Obama" , "2013-Obama"),], margin = "documents", method = "cosine") obama_simil # dotchart(as.list(obamaSimil)$"2009-Obama", xlab = "Cosine similarity")
We can use these distances to plot a dendrogram, clustering presidents:
data(data_corpus_sotu, package = "quanteda.corpora") pres_dfm <- dfm(corpus_subset(data_corpus_sotu, Date > as.Date("1980-01-01")), stem = TRUE, remove_punct = TRUE, remove = stopwords("english")) pres_dfm <- dfm_trim(pres_dfm, min_termfreq = 5, min_docfreq = 3) # hierarchical clustering - get distances on normalized dfm pres_dist_mat <- textstat_dist(dfm_weight(pres_dfm, "prop")) # hierchical clustering the distance object pres_cluster <- hclust(pres_dist_mat) # label with document names pres_cluster$labels <- docnames(pres_dfm) # plot as a dendrogram plot(pres_cluster, xlab = "", sub = "", main = "Euclidean Distance on Normalized Token Frequency")
We can also look at term similarities:
sim <- textstat_simil(pres_dfm, pres_dfm[, c("fair", "health", "terror")], method = "cosine", margin = "features") lapply(as.list(sim), head, 10)
Here is a demonstration of unsupervised document scaling comparing the "wordfish" model:
library("quanteda.textmodels") # make prettier document names ie_dfm <- dfm(data_corpus_irishbudget2010) textmodel_wordfish(ie_dfm, dir = c(2, 1))
quanteda makes it very easy to fit topic models as well, e.g.:
quant_dfm <- dfm(data_corpus_irishbudget2010, remove_punct = TRUE, remove_numbers = TRUE, remove = stopwords("english")) quant_dfm <- dfm_trim(quant_dfm, min_termfreq = 4, max_docfreq = 10) quant_dfm set.seed(100) if (require("stm")) { my_lda_fit20 <- stm(quant_dfm, K = 20, verbose = FALSE) plot(my_lda_fit20) }
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.