the possibility to perform analyses goes far beyond counting. Counting terms (and more complex lexical units) however is the fundamental operation for more complex algorithmic analyses and can also constitute meaningful research results by itself.
Counting can be understood as a measurement procedure. As with all other steps of analysis, each counting operation should be considered in terms of validity: Is it ensured that I measure what I intend to measure? In particular because of the natural variation of language use, this is not trivial.
it is necessary to differentiate between absolute frequencies (count
) and relative frequencies (frequencies
, which is the normalization of frequencies by the division by corpus or subcorpus size, mostly abbreviated as freq
). In analyses the choice of using counts or frequencies has to be justified.
the fundamental methods which are explained here are count
, dispersion
and as.TermDocumentMatrix
. Like all other basic methods of the polmineR package, these methods are available for corpora and partition
objects.
among others, the creation of time series and dictionary based classification are examples used in the following.
library(polmineR) use("UNGA")
magrittr
, data.table
and xts
are used. If needed, these have to be installed and loaded.for (pkg in c("magrittr", "data.table", "xts", "lubridate")) if (!pkg %in% rownames(installed.packages())) install.packages(pkg) library(magrittr) library(data.table) library(xts)
lubridate
package is also needed and installed, but not loaded yet to avoid namespace conflicts with some functions of the data.table
package.count()
method {.smaller}query
) with the count()
method.count("UNGA", query = "refugee")
count
indicates the absolute frequency of the term, the column freq
indicates the (relative frequency). The frequency here is the result of the division of absolute frequency and corpus size.count("UNGA", query = "refugee")[["count"]] / size("UNGA")
query
a character
vector containing multiple queries can be passed.count("UNGA", query = c("refugee", "migrant"))
queries <- c( "alien", "emigrant", "evacuee", "expatriate", "foreigner", "immigrant", "migrant", "refugee" ) dt <- count("UNGA", query = queries)
count()
method is a data.table
. This can be cast to a data.frame
without loss. We sort them afterwards.df <- as.data.frame(dt) df <- df[order(df$count, decreasing = TRUE),] # sorting
par(mar = c(8,4,2,2)) # enlarge plane for more room for labels barplot(height = df$count, names.arg = df$query, las = 2)
par(mar = c(8,4,2,2)) # enlarge plane for more room for labels barplot(height = df$count, names.arg = df$query, las = 2)
count()
method and partition
objects {.smaller}count()
method can be applied on both corpora as well as on partition()
objects.unga_2015 <- partition("UNGA", year = 2015) count(unga_2015, query = "refugee")
partition()
method and the count()
method in a pipe. For this, it is required to have the magrittr
package installed and loaded. Using the pipe means to chain methods and functions via the pipe operator ("%>%") and to use the return value of the left expression as the input value of the right expression.partition("UNGA", year = 2015) %>% count(query = "refugee")
queries <- c("America", "borders", "crisis", "development", "economy", "freedom", "liberty", "markets", "wealth") par(mar = c(6,5,2,2), mfrow = c(2,2), cex = 0.6) for (us_president in c("Clinton", "Bush", "Obama", "Trump")){ dt <- partition("UNGA", speaker = us_president) %>% count(query = queries) barplot( height = dt$freq * 100000, names.arg = dt$query, # labels with query terms las = 2, # rotate labels to improve visuals main = us_president, ylab = "Count of Terms (per 100.000 Tokens)", ylim = c(0, 350) # shared scale for comparison ) }
queries <- c("America", "borders", "crisis", "development", "economy", "freedom", "liberty", "markets", "wealth") par(mar = c(6,5,2,2), mfrow = c(2,2), cex = 0.6) for (us_president in c("Clinton", "Bush", "Obama", "Trump")){ dt <- partition("UNGA", speaker = us_president) %>% count(query = queries) barplot( height = dt$freq * 100000, names.arg = dt$query, # labels with query terms las = 2, # rotate labels to improve visuals main = us_president, ylab = "Count of Terms (per 100.000 Tokens)", ylim = c(0, 350) # shared scale for comparison ) }
query
argument of the count()
method. This syntax will be explained in another collection of slides. In its most basic form, CQP can be used to pass regular expressions to the query. The query term is escaped in single quotes and the argument cqp
is set to TRUE.count("UNGA", query = "'refugee.*'", cqp = TRUE) # using CQP syntax
dt <- count("UNGA", query = "'refugee.*'", cqp = TRUE, breakdown = TRUE)
DT::datatable(dt)
there are two solutions addressing the problem that words can occurr in different inflected forms. It is possible to work with lemmatization, which can be activated via the positional attribute 'lemma'. Another possibility is the development of accurate regular expressions.
a reminder: 'Lemmatization' describes the process to reduce a word to its basic form. CWB indexed corpora - which is the data format used by polmineR - can contain the positional attribute 'lemma'. Using the count()
method, the lemmatized form is used when assigning the value 'lemma' to the argument p_attribute.
count("UNGA", query = "refugee", p_attribute = "lemma")
queries <- c( asylum = "'.*asylum.*'", border = '"border.*"', migrant = '"(|e|im)migrant(|s)"', migration = "'.*migration.*'", refugee = '"refugee.*"', visa = "'visa'" ) par(mar = c(6,5,2,2), mfrow = c(2,2), cex = 0.6) for (us_president in c("Clinton", "Bush", "Obama", "Trump")) { partition("UNGA", speaker = us_president) %>% count(query = unname(queries), cqp = TRUE, p_attribute = "word") -> dt barplot( height = dt$freq * 100000, names.arg = names(queries), las = 2, main = us_president, ylim = c(0, 100) ) }
queries <- c( asylum = "'.*asylum.*'", border = '"border.*"', migrant = '"(|e|im)migrant(|s)"', migration = "'.*migration.*'", refugee = '"refugee.*"', visa = "'visa'" ) par(mar = c(6,5,2,2), mfrow = c(2,2), cex = 0.6) for (us_president in c("Clinton", "Bush", "Obama", "Trump")) { partition("UNGA", speaker = us_president) %>% count(query = unname(queries), cqp = TRUE, p_attribute = "word") -> dt barplot( height = dt$freq * 100000, names.arg = names(queries), las = 2, main = us_president, ylim = c(0, 100) ) }
counting words is a fast approach and makes it easy to produce nice looking visualizations. Valid conclusions about seemingly basic relationships (such as the linguistic variation between parties or time) require scientific rigour, as the examples above show.
using lemmatized forms of words in a corpus can be an efficient way to capture inflected word forms as well. One problem here might be that neologisms cannot necessarily be lemmatized.
one possible alternative is the diligent development of regular expressions to capture different linguistic variations. The potential of the CQP syntax was only hinted at and is explained in more detail later. Of particular interest is the possibilty to capture multi-word expressions with this approach.
diachronic and synchronic analyses of language use are central use cases when working with large corpora. They can be used to analyse language variation over time (diachronic) or between other strucutral characteristics at the same time (synchronic).
the dispersion()
method facilitates efficient counts of frequencies by one or two dimensions (here: s-attributes).
dt <- dispersion("UNGA", query = "refugee", s_attribute = "year") head(dt) # just looking at the top of the table
dispersion()
method as well.par(mfrow = c(1,1))
count()
method the argument freq
is used here to force the normalization of the relativ frequencies.dt <- dispersion("UNGA", query = "refugee", s_attribute = "year", freq = TRUE)
just as with count()
the return value of the dispersion()
method is a data.table
. The lossless conversion to a data.frame is possible.
the result of the distribution analysis can be visualized easily as a bar plot.
barplot( height = dt[["freq"]] * 100000, names.arg = dt[["year"]], las = 2, ylab = "Hits per 100.000 Terms" )
barplot( height = dt[["freq"]] * 100000, names.arg = dt[["year"]], las = 2, ylab = "Hits per 100.000 Terms" )
dt <- dispersion("UNGA", query = '"[Rr]efugee(|s)"', cqp = TRUE, s_attribute = c("year", "state_organization")) # creating the index for columns with a sum greater than 200 idx <- which(colSums(dt[,2:ncol(dt)], na.rm = TRUE) > 200) + 1 # subsetting the dt before by this index (as well as the year column) dt_min <- dt[,c(1, idx), with = FALSE] # removing the column NA and the rows for 1993 and 2018 which are only partly in the corpus dt_min <- dt_min[2:(nrow(dt_min)-1),-"NA"]
xts
package. We create a xts
object on the basis of the table created in the previous step.ts <- xts(x = dt_min[,c(2:ncol(dt_min)), with = FALSE], order.by = as.Date(sprintf("%s-01-01", dt_min[["year"]])) ) head(ts)
xts
{.smaller}plot.xts( ts, multi.panel = TRUE, col = RColorBrewer::brewer.pal(12, "Set3"), lwd = 2, yaxs = "r" )
xts
(cont.) {.smaller}plot.xts( ts, multi.panel = TRUE, col = RColorBrewer::brewer.pal(12, "Set3"), lwd = 2, yaxs = "r" )
par(mar = c(4,2,2,2)) dt <- dispersion("UNGA", query = '"[Rr]efugee(s|)"', cqp = TRUE, s_attribute = "date") dt <- dt[!is.na(as.Date(dt[["date"]]))] ts <- xts(x = dt[["count"]], order.by = as.Date(dt[["date"]])) plot(ts)
as time units bigger than a single day we want to use week, month, quarter and year. To calculate 'weeks' we use the lubridate
package.
now we create aggregated time series objects. The code below is deliberately condensed and not necessarily easy to understand at first glance. When in doubt, use it via copy & paste to see its effects.
ts_week <- aggregate(ts, {a <- lubridate::ymd(paste(lubridate::year(index(ts)), 1, 1, sep = "-")); lubridate::week(a) <- lubridate::week(index(ts)); a}) ts_month <- aggregate(ts, as.Date(as.yearmon(index(ts)))) ts_qtr <- aggregate(ts, as.Date(as.yearqtr(index(ts)))) ts_year <- aggregate(ts, as.Date(sprintf("%s-01-01", gsub("^(\\d{4})-.*?$", "\\1", index(ts)))))
par(mfrow = c(2,2), mar = c(2,2,3,1)) plot(as.xts(ts_week), main = "Aggregation: Week") plot(as.xts(ts_month), main = "Aggregation: Month"); plot(as.xts(ts_qtr), main = "Aggregation: Quarter") plot(as.xts(ts_year), main = "Aggregation: Year")
par(mfrow = c(2,2), mar = c(2,2,3,1)) plot(as.xts(ts_week), main = "Aggregation: Week") plot(as.xts(ts_month), main = "Aggregation: Month"); plot(as.xts(ts_qtr), main = "Aggregation: Quarter") plot(as.xts(ts_year), main = "Aggregation: Year")
the analysis of distributions by different s-attributes constitutes the basis of diachronic and synchronic analyses. Focal point of these examples was timeseries data. Here it is recommended to work with specialized packages such as xts
or zoo
.
linguistic time series data are observations which occurr irregularily. While measuring temperature can be performed daily, parliaments and assemblies do not meet every day and newspapers are not published on sundays or on holidays. Thus, it is relevant for the analysis (and resulting visualizations) to aggregate data in a way that accounts for that, i.e. aggregating bigger time spans.
in diachronic analyses in particular, the possible change of meaning of terms should be considered: Does a political term mean today what it meant 20 years ago? To create valid results when counting, will often be necessary to additionally perform some random concordance analyses to ensure that a relevant change of word meaning is not overlooked.
partition
objects but also on partition_bundle
objects. This might be necessary in different use cases. Here, a basic recipe for dicitonary based classifications is provided. The first step is to create a partition_bundle
object. We do this here based on the state or organization of a speaker in one single day for the year 2016.unga_2016 <- partition("UNGA", year = 2016) pb <- partition_bundle(unga_2016, s_attribute = "date") nested <- lapply( pb@objects, function(x) partition_bundle(x, s_attribute = "state_organization", verbose = FALSE) ) debates <- flatten(nested) names(debates) <- paste( blapply(debates, function(x) s_attributes(x, "date")), blapply(debates, function(x) name(x)), sep = "_" )
dict <- c("asylum", "escaping", "fleeing", "migration", "refugee")
partition_bundle
and sort the resulting data.table in descending order. This partition_bundle
with all debates can be indexed by the name of each individual partition which exceeds a certain threshold (here: 10) of the dictionary score. dt <- count(debates, query = dict) %>% data.table::setorderv(cols = "TOTAL", order = -1L) debates_mig <- debates[[ subset(dt, TOTAL >= 10)[["partition"]] ]]
debates_mig[[1]] %>% read() %>% highlight(yellow = dict)
query
argument in the count()
method, the count will be performed on the entire corpus or partition
object. Via the argument p_attribute
it is determined which p-attribute is counted. The return value of this operation is a count
object.p <- partition("UNGA", year = 2008) cnt <- count(p, p_attribute = "word") sum(cnt[["count"]]) == size(p)
subset()
method. unga_2008 <- partition("UNGA", year = 2008) dt <- count(unga_2008, p_attribute = c("word", "pos")) %>% subset(pos %in% c("NN", "JJ")) %>% data.table::as.data.table(.) %>% data.table::setorderv(., cols = "count", order = -1L) %>% head()
the count of all tokens of a partition is the basis for several more advanced approaches, for example term extractions or the creation of term-document-matrices which can serve as the input for a lot of algorithmic text mining approaches such as topic modelling.
in the polmineR package, the as.TermDocumentMatrix()
method is the default routine to prepare text-document-matrices. The method can be applied to count_bundle
or partition_bundle
objects as well as on a character
vector which identifies a corpus. For further information see the information in the documentation of these methods.
Counting does have fundamental importance for the analysis of corpora. These slides should inform you about how this is done internally in polmineR. The most important message is that even this seemingly basic operation can lead to invalid research if it is not done with scientific rigour.
To validate the results of counting, the usage of concordances (which are explained in the next slides) can be important. The CQP syntax which has been used here, is explained in more depth in a later set of slides.
word <- get_token_stream("UNGA", p_attribute = "word") Encoding(word) <- registry_get_encoding("UNGA") lemma <- get_token_stream("UNGA", p_attribute = "lemma") Encoding(lemma) <- registry_get_encoding("UNGA") dt <- data.table::data.table(word = word, lemma = lemma) token <- "refugee" q <- iconv(token, from = "UTF-8", to = "latin1") dt2 <- dt[lemma == q] dt2[, .N, by = .(word)]
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.