set.seed(42)
in qualitative text analysis there is a number of algorithms which require a so-called term-document-matrix as input. This is the case for example for topic models but also a lot of other Machine Learning approaches or - of particular importance in political science - for scaling algorithms such as wordscore or wordfish.
term-document-matrices are based on the so-called 'bag-of-words' representation of text: By representing a text as its mere count of words, its (grammatical) structure is disolved. A term-document-matrix gathers the vector representations of all documents of a corpus, with the terms as rows and documents as columns. The value in each cell then indicates how often term i occurs in document j.
from a technical perspective, term-document-matrices have to be realized as sparse matrices because given the vast variety of vocabulary for any given term there are a lot of documents the term does not occur in. The polmineR package uses the TermDocumentMatrix
class of the tm package which itself is a slight modification of the simple_triplet_matrix
of the slam package.
Please note that part of the functionality for the following workflow is only available with polmineR version r as.package_version("0.7.9.9006")
. If required, install the current development version of polmineR.
in the examples, the UNGA corpus is used. The corpus has to be activated after loading polmineR.
if (packageVersion("polmineR") < package_version("0.7.10.9006")) devtools::install_github("PolMine/polmineR", ref = "dev") library(polmineR) use("UNGA")
for (pkg in c("magrittr", "slam", "tm", "quanteda", "Matrix")){ if (!pkg %in% rownames(installed.packages())) install.packages(pkg) library(package = pkg, character.only = TRUE) }
in the polmineR
package to create objects of the class TermDocumentMatrix
or DocumentTermMatrix
the methods as.TermDocumentMatrix()
and as.DocumentTermMatrix()
are available.
depending on which package should be used for further algorithmic analysis, the classes of the Matrix package (sparseMatrix
) or a document-feature matrix (dfm
) of the quanteda package could be necessary. This can be easily achieved by type transformation.
it is important to understand that the TermDocumentMatrix
class of the tm package and the simple_triplet_matrix
of the slam package are mostly identical but for the addition of a weight factor which is the term frequency.
the simple_triplet_matrix
is defined by three vectors i, j and v. The first vector i indicates the row of the value, the second vector j indicates the column of the value and vector v indicates the value itself. By restricting the stored values to those which are actually defined, necessary memory capacities can be minimized. Otherwise a matrix with many documents and large vocabulary leads to matrices which are huge and cannot be handled by even capable machines.
DocumentTermMatrix
is to apply the as.DocumentTermMatrix()
method to a corpus. The only parameters which have to be stated explicitly are:dtm <- polmineR::as.DocumentTermMatrix("UNGA", p_attribute = "word", s_attribute = "year")
partition_bundle
{.smaller}both use cases which were discussed earlier assume that there actually is a s-attribute which describes the inner structure of the corpus or partition. This is fine if you want to analyse a corpus in the lines of given s-attributes such as "year" in the example above. However, it is conceivable that documents can be defined by various combinations of s-attributes. To achieve this, the input of the as.DocumentTermMatrix()
method can also be a partition_bundle
object.
the following scenario illustrates these processing steps. It is important that the partition
objects in the partition_bundle
have to be enriched (i.e. a count has to be performed). The argument col
of the as.DocumentTermMatrix()
method specifies which column of the enriched partition
objects in the partition_bundle
(which then is also a count_bundle
) contains the count values which are passed to the cells of the Document-Term-Matrix.
unga2016 <- partition("UNGA", year = 2016) unga2016_speakers <- partition_bundle(unga2016, s_attribute = "speaker", progress = TRUE) unga2016_speakers <- enrich(unga2016_speakers, p_attribute = "word", progress = TRUE) dtm <- polmineR::as.DocumentTermMatrix(unga2016_speakers, col = "count")
partition_bundle
(cont.) {.smaller}as.matrix(dtm)[sort(sample(1:913, 10)), c(1:5, 300:304)]
as.speeches()
to create a partition_bundle {.smaller}doit <- !file.exists("~/lab/tmp/unga2000s_speeches_lds.RData")
in plenary protocols a plausible definition of documents a document-term-matrix should be based on could be a single speech of a delegate. The polmineR package provides a method to create a partition_bundle
of speeches with the as.speeches()
method. This can be applied to both corpora and partition
objects.
the segmentation of plenary protocols follows the heuristical assumption that a speech is an utterance of a single speaker on one day which is not interrupted by another speaker's utterance longer than 500 tokens.
otherwise shorter interruptions such as interjections or questions would lead to the identification of one actually coherent speech of a speaker as two seperate speakers. At the same time, the possibility that one speaker presents multiple speeches on one day is taken into account.
unga_2000s <- partition("UNGA", year = 2000:2010) unga_2000s_speeches <- as.speeches(unga_2000s, s_attribute_date = "date", s_attribute_name = "speaker") unga_2000s_speeches <- enrich(unga_2000s_speeches, p_attribute = "word") dtm <- polmineR::as.DocumentTermMatrix(unga_2000s_speeches, col = "count")
in most use cases (for example topic modelling) a matrix which was not filtered at all will be unnecessarily large, resulting in longer computer times and noisier results. Hence, it is recommended to prune the matrix, removing words that rarely occur as well as noisy words and stopwords.
in the following, in a first step we remove documents with a length of less than a certain threshold (here arbitrarily set to 100 words). The length of the document equals the sum of the token counts per row (row_sums
).
short_docs <- which(slam::row_sums(dtm) < 100) if (length(short_docs) > 0) dtm <- dtm[-short_docs,]
col_sums
). These words are removed from the Document-Term-Matrix (dtm
).rare_words <- which(slam::col_sums(dtm) < 5) if (length(rare_words) > 0) dtm <- dtm[,-rare_words]
noise()
method of the polmineR package supports the identification of noisy words within the vocabulary (tokens with special characters, stopwords). These are removed as well.noisy_tokens <- noise(colnames(dtm), specialChars = NULL, stopwordsLanguage = "en") noisy_tokens_where <- which(unique(unlist(noisy_tokens)) %in% colnames(dtm)) dtm <- dtm[,-noisy_tokens_where]
stopit <- tm::stopwords("en") stopit_upper <- paste(toupper(substr(stopit, 1, 1)), substr(stopit, 2, nchar(stopit)), sep = "") stopit_upper_where <- which(stopit_upper %in% colnames(dtm)) if (length(stopit_upper_where) > 0) dtm <- dtm[, -stopit_upper_where]
empty_docs <- which(slam::row_sums(dtm) == 0) if (length(empty_docs) > 0) dtm <- dtm[-empty_docs,]
lda <- topicmodels::LDA( dtm, k = 150, method = "Gibbs", control = list(burnin = 1000, iter = 3L, keep = 50, verbose = TRUE) )
if (doit == TRUE){ saveRDS(lda, file = "~/lab/tmp/unga2000s_speeches_lds.RData") } else { lda <- readRDS(file = "~/lab/tmp/unga2000s_speeches_lds.RData") }
lda_terms <- terms(lda, 10)
n_terms <- 5L lda_terms <- terms(lda, n_terms) y <- t(lda_terms) colnames(y) <- paste("Term", 1:n_terms, sep = " ") DT::datatable(y)
pb <- partition("UNGA", year = 2000:2010) %>% as.speeches(s_attribute_date = "date", s_attribute_name = "speaker") %>% enrich(p_attribute = c("word", "pos"), progress = TRUE) %>% subset(pos == "NN")
pb@objects <- lapply(pb@objects, function(x){x@stat[, "pos" := NULL]; x@p_attribute <- "word"; x})
dtm <- polmineR::as.DocumentTermMatrix(pb, col = "count") short_docs <- which(slam::row_sums(dtm) < 100) if (length(short_docs) > 0) dtm <- dtm[-short_docs,] rare_words <- which(slam::col_sums(dtm) < 5) if (length(rare_words) > 0) dtm <- dtm[,-rare_words] empty_docs <- which(slam::row_sums(dtm) == 0) if (length(empty_docs) > 0) dtm <- dtm[-empty_docs,] lda <- topicmodels::LDA( dtm, k = 150, method = "Gibbs", control = list(burnin = 1000, iter = 3L, keep = 50, verbose = TRUE) )
if (doit == TRUE){ saveRDS(lda, file = "~/lab/tmp/lda_unga2000s_speeches_pos.RData") } else { lda <- readRDS(file = "~/lab/tmp/lda_unga2000s_speeches_pos.RData") }
n_terms <- 5L lda_terms <- terms(lda, n_terms) y <- t(lda_terms) colnames(y) <- paste("Term", 1:n_terms, sep = " ") DT::datatable(y)
there is a number of R packages which can be used for computer assisted text analysis. While a lot of these packages require some version of Term-Document-Matrix as input, the specific type of matrix can vary. Methods of the popular quanteda
package use a Document-Feature-Matrix. By type transformation we can cast the matrices polmineR
provides to such a dfm
.
first we create a partition_bundle
as illustrated before. In a second step, we cast this partition_bundle
to a sparse matrix.
pb <- partition("UNGA", speaker = "Clinton|Bush|Obama|Trump", regex = TRUE) %>% partition_bundle(s_attribute = "speaker") pb <- enrich(pb, p_attribute = "lemma") dtm <- polmineR::as.sparseMatrix(pb, col = "count") dtm <- Matrix::t(dtm)
pg_dfm <- new( "dfm", i = dtm@i, p = dtm@p, x = dtm@x, Dim = dtm@Dim, Dimnames = list( docs = dtm@Dimnames$Docs, features = dtm@Dimnames$Terms ) )
among other things quanteda offers an easily applicable implementation of wordfish. Wordfish is a popular model used for scaling ideological positions of political actors. These positions are derived from word frequencies. For an overview about the underlying algorithm and a selection of publications which use wordfish, see here.
we want to fit a wordfish model. Since this requires relatively much memory and we did not manually reduce the size of the matrix before, we use the dfm_trim()
method of quanteda to do so now. We remove words which occur less than ten times.
pg_dfm_red <- dfm_trim(pg_dfm, min_termfreq = 10)
wfm_1 <- textmodel_wordfish(pg_dfm_red, c(3,4))
summary()
method to get a first glimpse at the results.wordfish_summary <- summary(wfm_1)
wordfish_summary$estimated.document.positions
head(wordfish_summary$estimated.feature.scores, 4) # show first 4 terms
textplot_scale1d(wfm_1, doclabels = pg_dfm_red@Dimnames$docs)
this shows the results of the Wordfish model must be interpreted carefully. Which scale is represented here?
it might be helpful to have a look at the beta values per term.
betaterm <- data.frame(terms = wfm_1$features, beta = wfm_1$beta)
head(betaterm[order(betaterm$beta),], 10) head(betaterm[order(betaterm$beta, decreasing = TRUE),], 10)
textplot_scale1d(wfm_1, margin = "features", highlighted = c("America", "liberty", "development", "terror", "border", "inequality"))
textplot_scale1d(wfm_1, margin = "features", highlighted = c("America", "liberty", "development", "terror", "border", "inequality"))
whether Term-Document-Matrix, Document-Term-Matrix or Document-Feature-Matrix, whether sparse or not: The representation of text in matrices with terms on one and documents on the other side is of utmost importance for a lot of applications in the realm of computer assisted text analysis.
the polmineR package offers the possibility to cast corpora into these different kinds of matrices to facilitate all kinds of analyses.
note that this representation disolves the actual structure of the underlying texts. This bag-of-words approach thus follows an approach which is contrary to hermeneutic-interpretative approaches such as keyword-in-context analyses. A triangulated precedure for validation is recommended.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.