The Art of Counting

Counting is a very basic task when working with text. Counting is essentially simple: It is nothing but making statements how often a word, or words, or linguistic patterns occurr in a corpus, or a subcorpus. However, a social scientist counting words or linguistic patterns should see counting is measuring. Counts are stable and objective, the reliability and intersubjectivity of the measurement are minor problems. Words and linguistic patterns can be observed directly, many issues associated with measuring latent variables do not arise. But counting is not trivial. Without conscious reflection, what is actually measured by counting lexical items, without an awareness that counting is measuring and needs to meet standards of validity, a simple move "from words to numbers" (...) can violate fundamental concerns of social science. Counting without methodological reflection will lead to flawed and bad science.

If done well, counting can be powerful instruments. Corpora offer efficient ways to extract information from vast amounts of text efficiently. In particular, when we move from counts of patterns in one subcorpus to the dispersion of counts in one or two dimensions, the basis for comparative statements that involve a synchronic or diachronic dimension can be generated quickly. Sometimes, simply counting single words may do. Often, we may want to count more complex linguistic patterns. The following sections will discuss the move from counts to dispersions.

The corpus used for the examples is the GermaParl corpus. Before moving on, load the polmineR library, and activate the corpus.

library(polmineR)
use("GermaParl")

The Basics of Counting

Counting Words

The method in the polmineR-package to perform counts, it is not necessarily a surprise, is called 'count'. It can be used for a variety of scenarios. To learn about the uses of the count-method, call the documentation (i.e. the help page) for the method.

?count

Count can be applied to different objects. We start by

count("GERMAPARL", query = "Zuwanderer")
auslaender <- count("GERMAPARL", query = "Ausländer")[["count"]]
zuwanderer <- count("GERMAPARL", query = "Zuwanderer")[["count"]]
einwanderer <- count("GERMAPARL", query = "Einwanderer")[["count"]]
migranten <- count("GERMAPARL", query = "Migranten")[["count"]]
gastarbeiter <- count("GERMAPARL", query = "Gastarbeiter")[["count"]]
barplot(
  c(auslaender, zuwanderer, einwanderer, migranten),
  names.arg = c("Ausländer", "Zuwanderer", "Einwanderer", "Migranten"),
  las = 2, cex.axis = 1, 
)
y <- dispersion("GERMAPARL", query = "Flüchtlinge", sAttribute = "year")
cdu <- partition("GERMAPARL", party = "CDU")
gruene <- partition("GERMAPARL", party = "GRUENE")
cnt_cdu <- count(
  cdu,
  query = c("Ausländer", "Zuwanderer", "Einwanderer", "Migranten", "Flüchtlinge")
  )
cnt_gruene <- count(
  gruene,
  query = c("Ausländer", "Zuwanderer", "Einwanderer", "Migranten", "Flüchtlinge")
  )

par(mfrow = c(1,2))
barplot(
  cnt_cdu[["count"]], names.arg = cnt_cdu[["query"]], las = 2,
  ylim = c(0,2000), main = "CDU"
  )
barplot(
  cnt_gruene[["count"]], names.arg = cnt_gruene[["query"]], las = 2,
  ylim = c(0,2000), main = "Grüne"
  )

Now we move to regular expressions.

count("GERMAPARL", '"Multikult.*"', cqp = TRUE)
count("GERMAPARL", '"Multikult.*"', cqp = TRUE, breakdown = TRUE)
sAttributes("GERMAPARL", "party")
csu <- partition("GERMAPARL", party = "CSU")
cdu <- partition("GERMAPARL", party = "CDU")
spd <- partition("GERMAPARL", party = "SPD")
gru <- partition("GERMAPARL", party = "GRUENE")

Multi-word expressions and complex queries

count("GERMAPARL", query = '"Menschen" "mit" "Migrationshintergrund"')
cnt <- count("GERMAPARL", query = '[pos = "ADJA"] "Integration"', breakdown = TRUE)
head(cnt, 10)

From Counts to Dispersion

Visualising Dispersions

Comparing Subcorpora

Time-Series Analysis



PolMine/UCSSR documentation built on June 13, 2022, 10:23 p.m.