knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

Introduction

CollocateR is an R package that provides significance testing for words or phrases that are collocated with a keyword or phrase. Employing the 'Keywords in Context' (kwic) and tokenising functions from the Quanteda package[^fn1] and the tidy workflows (especially the unnest_tokens function) of the Tidytext package,[^fn2]

The package was built originally to support research towards O'Kelly's 'Human Rights and the Grammar of Social Responsibility', Social & Legal Studies in 2019 (early online here) but has been expanded to cover a broader range of corpus linguistics tests for collocates.

What is collocation?

Keywords in context are a sequence of words that sit in the 'neighbourhood' of a keyword or node. That is, the word or phrase sits within a window of x words of the keyword. Analysing these words and phrases -- how words and phrases 'attract and repel' each other (Barnbrook et al., 2013: 164ff)[^fn3] -- can help us understand the keywords more significantly. 'You shall', in other words, 'know a word by the company it keeps' (Firth, 1957: 11).[^fn4]

Collocation testing in essence involves the construction of two corpora: a corpus constructed of keywords in context (kwics) and the docment or corpus beyond those words. The question to answer is: is a word's presence in the corpus of kwics significant in the context of its frequency elsewhere in the corpus? This question can only be answered by employing significance measures to exclude words that recur in context because they recur everywhere.

Following Piao,[^fn6] collocations occur when keyword $x$ and collocate $y$ coincide within a certain 'neighbourhood' - generally some number of words - of each other. They may be right next to each other (so 'bigrams' like 'Prime Minister' or 'human rights') or 'skipgrams' like 'falling' and 'love' (separated by 'in').

Simply counting words will not suffice as word-frequencies are not a good guide to significant relationships between words. Some words are very common and so are likely to appear near more or less every other word.

Other words are common depending on their context. For example, a frequency count of 'environmental' in corporate reports will likely show that 'risk' is a frequent collocate. But 'risk' is the lingua franca of corporate reports and so is likely to occur near many key terms ("business risk", "health and safety risk", "human rights risk", etc etc). The frequency of a word like 'risk' may not provide us with much information about how the word 'environmental' is being put to work in the corpus: it recurs everywhere.

Signficance testing seeks to draw out the more informative relationships between words, largely by seeing whether and how a word's frequency as a collocate relate to its frequency in the corpus as a whole. How this is to be done is a matter of some debate.

The tests

The collocateR package will test significance in the following ways:

Where:

Log-Likelihood tests for the significance of differences in frequency between two corpora. Following Church and Hanks,[^fn5] log-likelhood is calculated as:

$$ \begin{aligned} Log-Likelihood{}={}xy{}\log{xy} + b{} \log{b} + c{}\log{c} + d{}\log{d} \ - (xy+b){}\log{(xy+b)} - (xy+c){}\log{(xy+c)} \ - (b+d){}\log{(b+d)} + (a+b+c+d){}\log{(a+b+c+d)} \end{aligned} $$ Pointwise Mutual information:

$$ \begin{aligned} pmi{}={}{\log{}{{prob(xy)}\over{prob(x){}prob(y)}}} \end{aligned} $$

Normalised Pointwise Mutual information:

$$ \begin{aligned} npmi{}={}{\log{}{{prob(xy)}\over{prob(x){}prob(y)}} \over{-\log_2{}{(prob(xy))}}} \end{aligned} $$

Mutual information cubed

$$ \begin{aligned} pmi{}={}\log{{prob(xy)^3}\over{prob(x)prob(y)}} \end{aligned} $$

Z score

$$ \begin{aligned} {z score{}={}}{{xy}-(({x\over{w-y}})ys)\over{}\sqrt{({x\over{w-y}})(1-{x\over{w-y}})}} \end{aligned} $$

[^fn1]: Benoit K, Watanabe K, Wang H, Nulty P, Obeng A, Müller S, Matsuo A (2018). "quanteda: An R package for the quantitative analysis of textual data." Journal of Open Source Software, 3(30), 774. doi: 10.21105/joss.00774 (URL: http://doi.org/10.21105/joss.00774), .

[^fn2]: Silge J, Robinson D (2016). "tidytext: Text Mining and Analysis Using Tidy Data Principles in R." JOSS, 1(3). doi: 10.21105/joss.00037 (URL: http://doi.org/10.21105/joss.00037), .

[^fn3]: Barnbrook G, Mason O and Krishnamurth R (2013). Collocation: Applications and Implications. London: Palgrave Macmillan.

[^fn4]: Firth JR (1957). "A synopsis of linguistic theory 1930-1955." Studies in linguistic analysis. Oxford: Blackwell.

[^fn5]: Church KW and Hanks P (1989). "Word association norms, mutual information, and lexicography." Association for Computational Linguistics, pp. 76-83. DOI: 10.3115/981623.981633.

[^fn6]: Piao SS (2002). "Word Alignment in English–Chinese Parallel Corpora" 17 Literary and Linguistic Computing 17(2), 207-230. DOI: 10.1093/llc/17.2.207.



cokelly/collocateR documentation built on May 13, 2019, 8:49 p.m.