options("kableExtra.html.bsTable" = TRUE) # provides nicer looking tables
if (! "icon" %in% rownames(installed.packages()) ) devtools::install_github("ropenscilabs/icon")
library(kableExtra)
In the analysis of corpora we are often interested in the synchronic or the diachronic variation of language. Hence, the creation of subcorpora is central for the meaningful use of corpora in research.
in the terminology of the polmineR
package, instead of subcorpus the term partition is being used. On the one hand, this follows the conventional use of the French lexicometric tradition. On the other hand and more importantly, partition does occur both as a verb as well as a noun. In programming it is good practice to use verbs when naming methods. The partition()
method partitions a corpus and creates an object of the class partition
which contains all relevant information to describe a partition.
the creation of the partition is based on s-attributes (structural attributes). These attributes are not limited to the level of text but can also encompass passages of text (i.e. annotations, named entities, interjections in plenary protocols) underneath the text level.
which s-attributes are available in a corpus and how they are characterized can be determined with the s_attributes
method.
The collection of slides is using the polmineR
package and the UNGA
corpus. The installation was explained in more detail in an earlier set of slides.
library(polmineR) use("UNGA")
Please note: The functionalities explained here are only available in polmineR version r as.package_version("0.7.10")
or above. Install the correct version of the package accordingly.
For the following examples we additionally use the magrittr package.
library(magrittr)
partition
: Basics {.smaller}the objective of the first example is the creation of a partition which contains all speeches in the United Nations General Assembly during the year of the financial crisis in 2008.
First we determine which values for the s-attribute year
are available for the corpus.
s_attributes("UNGA", s_attribute = "year")
unga_2008 <- partition("UNGA", year = "2008")
partition
: Next Steps {.smaller}unga_2008 <- partition("UNGA", year = 2008) # identical
unga_2009_ff <- partition("UNGA", year = 2009:2013)
partition
: Next Steps {.smaller}partition()
method or the s_attributes()
method instead of functions Why? The specific characteristic of a method is that they adjust their behaviour based on the kind of object they receive as their input. To see for which classes of objects a method is defined you can access the documentation of the method via the help functionality (i.e. ?partition()
or ?s_attributes()
). s_attributes()
method we used on corpora also on partition
objects. s_attributes(unga_2008, s_attribute = "role")
unga_2008_min <- partition("UNGA", year = 2008, role = "head of state")
partition
: "Zooming" {.smaller}partition()
method, you see that the method can be used on partition
objects as well.a <- partition("UNGA", year = 2008) b <- partition(a, role = "head of state")
size()
method we used for corpora, works for partition
objects as well.size(unga_2008_min) == size(b)
partition()
method in a Pipe {.smaller}partition
above also like this:unga_2008_min <- "UNGA" %>% partition(year = 2008) %>% partition(role = "head of state")
"UNGA" %>% partition(year = 2008) %>% count(query = '"financial" "crisis"', cqp = TRUE)
partition
: "Regular Expressions" {.smaller}uk <- partition("UNGA", state_organization = "United Kingdom", regex = TRUE) s_attributes(uk, "state_organization")
grep()
functionthe combination of the methods s_attributes()
and partition()
can be used to identify relevant texts step by step. A validity check with s_attributes()
should be performed regularly during the research process to exclude unwanted hits from the analysis.
example: on which dates did the representative of the United Kingdom speak during and after the financial crisis 2008 and actually address the topic?
uk_2008_ff <- partition("UNGA", state_organization = "United Kingdom", year = 2008:2009) for (day in s_attributes(uk_2008_ff, s_attribute = "date")){ dt <- partition(uk_2008_ff, date = day) %>% count(query = '"financial" "crisis"', cqp = TRUE) cat(sprintf("%s -> N Crisis = %s\n", day, dt[["count"]])) }
read()
method.uk_fcrisis <- partition(uk_2008_ff, date = "2008-09-26")
uk_fcrisis <- partition(uk_2008_ff, date = "2008-09-26") read(uk_fcrisis)
read()
method can be converted to a character vector which can be stored.writeLines(text = as.character(read(uk_fcrisis)), con = "uk_fcrisis.html")
this file can be viewed with any regular browser. An important purpose to create such files could be the eventual import into programmes for qualitative data analysis such as MAXQDA.
note: the created html is a 'technical html' which consists of the word representation as well as the corpus positions for each individual token. These corpus positions are invisible in the regular browser but can be used to reimport annotations from programmes such as MAXQDA back into polmineR.
often it can be necessary to create a date specific partition
, for instance if the effect of events on the discourse should be investigated.
the most efficient way to achieve this is to cast the date information of the corpus into objects of the class Date
. With these, operations such as greater or smaller then can be performed or sequences can be created. In the following example we count the frequency of the term "terrorism" in the United Nations General Assembly in the year before and after nine-eleven.
pre_nineeleven <- as.Date("2000-09-11") nineten <- as.Date("2001-09-10") nineeleven <- as.Date("2001-09-11") post_nineeleven <- as.Date("2002-09-11") pre <- partition("UNGA", date = seq.Date(from = pre_nineeleven, to = nineten, by = "day")) post <- partition("UNGA", date = seq.Date(from = nineeleven, to = post_nineeleven, by = "day")) count(pre, "terrorism") count(post, "terrorism")
partition_bundle
{.smaller}multiple partition
objects can be consolidated to one partition_bundle
object. It is possible to create a partition_bundle
object using the partition_bundle
method by defining an s-attribute.
the partition_bundle()
method can be applied to both a corpus or partition
objects. For partition_bundle
objects regular R methods are available, for example summary()
in the following chunk
uk_2009 <- partition(uk_2008_ff, year = 2009) uk_2009_days <- partition_bundle(uk_2009, s_attribute = "date") summary(uk_2009_days)
partition_bundle
{.smaller}Many methods of the polmineR package are also available on partition_bundle
objects. This is also true for the count()
method we learned about earlier.
a character
vector can be used as the argument query
of the count()
method. This way, a count based on all partition
objects of a partition_bundle
can be performed with multiple query terms
count(uk_2009_days, query = c("crisis", "bank", "financial")) %>% show()
as.speeches()
to partition_bundle
{.smaller}an assumption of all examples above was, that all words in one protocol on one day are part of one single speech. However, especially prominent speakers in an assembly sometimes give multiple speeches on one day. In addition, text artefacts which are not part of the speech itself such as interjections or insertions of the chair of the chamber, should not be regarded as part of the speech.
accordingly, there is a as.speeches()
method in the polmineR package. The central heuristic of the method is that passages of speech by one single speaker can be subsumed into one speech if this speech is not interrupted by the number of tokens by another speaker which is defined by the value of the argument gap
. Per default this value is 500. This ensures that an interjection does not cause one speech to be counted as two. The return value of as.speeches()
is a partition_bundle
object.
partition
and partition_bundle
objects provide flexibility for the analysis. Keep in mind that all central methods of the polmineR package are implemented for both corpora and partition
as well as partition_bundle
objects.
partition
and partition_bundle
objects are objects of the so-called S4 system of object oriented programming in R. For objects of this class methods are defined which can be looked up in the according class documentation (?"partition-class"
and ?"partition_bundle-class"
respectively).
the S4 system which is used by polmineR for this implementation is assumed to be more complex and difficult to manage than the older S3 system. However, it ensures greater control over error messaging and thus, quality control. Further information about object oriented programming in R can be found in the online book "Advanced R" by Hadley Wickham.
after understanding the mechanisms of partition
and partition_bundle
objects we now have the basics for all further, substantially interesting approaches of analysis. In the next slides we have a look at the art of counting.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.