Used in corpus linguistics, the notion of keyness and keyness analysis is used in relation to the **key word defined as "a word which occurs with unusual frequency in a given text [...] by comparison with a reference corpus of some kind"** (Scott, 1997). Keyness is often used in research on corpus similarity and "aboutness" (the main concepts present in the text) (Gabrielatos, 2018).
Measures that help locate key words in a corpus by comparing it to another corpus are often shared with those exploring collocations (co-occurence of two or more words). Research on corpus linguistics concerned with these topics relies on a number of statistical tests. Conventionally, Chi-square or log-likelihood statictics are used (Dunning,1993; Gabrielatos, 2018).
In this framework, the occurence of the word is compared in two corpora:
Then, a null hypothesis that there is no difference in the distribution of the occurrences of this word in the target and reference corpora is tested. Alternative hypothesis is stating that there is a difference in the occurrence between the corpora. The obtained test statistic is then compared to the critical value for the desired level of statistical significance and the words above this threshold are selected as the key words of the target corpus.
Conventionally, corpus linguistics has been relying on Chi-square and log-likelihood to explore key words in corpora. However, a body of research has been re-evaluating the use of these measures as the only tool for key word identification.
While Chi-square/log-likelihood statistics do flag words which occur more frequently in the target corpus compared to the reference corpus, they do not measure the "effect size" - that is, the size of the observed difference in frequencies (Gabrielatos, 2018). This implies that we only have proof to claim that a word occurs with (statistically significant) different frequency in the target corpus compared to the reference corpus, but we cannot say anything about the size of this difference. Inquiring about the effect size would enable us to also quantify the extent of the keyness of the word. Not only corpus linguistics, but also research in social sciences could benefit of this addition to the keyness exploration.
Large inconsistencies between rankings of important words by frequency differences and statistical significance measures have been observed (Gabrielatos and Marchi, 2011; Gabrielatos, 2018), implying that significance measures might not be effective in highlighting the most characteristic key word differences between the corpora. Moreover, while significance values are affected by the size of the corpora, effect size statistics are not, allowing to compare results against different studies ( Pojanapunya and Watson Todd, 2016).
Conventionally used measures testing the significance of the difference of the occurence of a word in one versus another corpus are as follows:
Additionally, researchers have been proposing use of Bayes Factor as an alternative manner of null hypothesis testing, avoiding the limitations of p value choices:
Researchers have proposed several effect size measures which, combined with the statistical significance testing, can improve our understanding of keyness:
KeynessMeasures package provides a function that calculates these measures.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.