keyness | R Documentation |
Compute best-practice keyness measures (according to Evert 2022) for the frequency comparison of lexical items in two corpora. The function is fully vectorised and should be applied to a complete data set of candidate items (so statistical analysis can be adjusted to control the family-wise error rate).
keyness(f1, n1, f2, n2,
measure=c("LRC", "PositiveLRC", "G2", "LogRatio", "SimpleMaths", "Lockwords"),
conf.level=.95, alpha=NULL, p.adjust=TRUE, lambda=1)
f1 |
a numeric vector specifying the frequencies of candidate items in corpus A (target corpus) |
n1 |
sample size of target corpus, i.e. the total number of tokens in corpus A (usually a scalar, but can also be a vector parallel to |
f2 |
a numeric vector parallel to |
n2 |
sample size of reference corpus, i.e. the total number of tokens in corpus B (usually a scalar, but can also be a vector parallel to |
measure |
the keyness measure to be computed (see “Details” below) |
conf.level |
the desired confidence level for the |
alpha |
if specified, filter out candidate items whose frequency difference between |
p.adjust |
if |
lambda |
parameter |
This function computes a range of best-practice keyness measures comparing the relative frequencies
\pi_1
and \pi_2
of lexical items in populations (i.e. sublanguages) A and B,
based on the observed sample frequencies f_1, f_2
and the corresponding sample sizes n_1, n_2
.
The function is fully vectorised with respect to arguments f1
, f2
, n1
and n2
,
but only a single keyness measure can be selected for each function call.
All implemented measures are robust for the corner cases f_1 = 0
and f_2 = 0
, but f_1 = f_2 = 0
is not allowed.
Most of the keyness measures are directional,
i.e. positive scores indicate positive keyness in A (\pi_1 > \pi_2
)
and negative scores indicate negative keyness in A (\pi_1 < \pi_2
).
By contrast, the one-sided measures PositiveLRC
and SimpleMaths
only detect positive keyness in A,
returning small (and possibly negative) scores otherwise, i.e. in case of insufficient evidence for \pi_1 > \pi_2
and in case of strong evidence for \pi_1 < \pi_2
.
One-sided measures can be useful for a ranking of the entire data set as positive keyword candidates.
Hardie (2014) and other authors recommend to combine effect-size measures (in particular LogRatio
) with
a significance filter in order to weed out candidate items for which there is no significant evidence
against the null hypothesis H_0: \pi_1 = \pi_2
. Such a filter is activated by specifying the desired
significance level alpha
, and can be combined with all keyness measures.
In this case, the scores of all non-significant candidate items are set to 0.
The decision is based in the likelihood-ratio test implemented by the G2
measure
and its asymptotic \chi^2_1
distribution under H_0
.
Note that the significance filter can also be applied to the G2
measure itself, setting all scores
below the critical value for the significance test to 0.
For one-sided measures (PositiveLRC
and SimpleMaths
), candidates with significant evidence
for negative keyness are also filtered out (i.e. their scores are set to 0) in order to ensure a consistent ranking.
By default, statistical inference corrects for multiple testing in order to control family-wise error rates.
This applies to the significance filter as well as to the confidence intervals underlying LRC
and PositiveLRC
.
Note that the G2
scores themselves are never adjusted (only the critical value for the significance filter is modified).
Family size m
is automatically determined from the number of candidate items processed in a single function call.
Alternatively, the family size can be specified explicitly in the p.adjust
argument, e.g. if a large data set
is processed in multiple batches, or p.adjust=FALSE
can be used to disable the correction.
For the adjustment, a highly conservative Bonferroni correction \alpha' = \alpha / m
is applied to significance levels.
Since the large candidate sets and sample sizes often found in corpus linguistics tend to produce large numbers of false positives,
this conservative approach is considered to be useful.
See Evert (2022) and its supplementary materials for a more detailed discussion of the implemented best-practice measures and some alternatives.
G2
The log-likelihood measure (Rayson & Garside 2003: 3) computes the score G^2
of a likelihood-ratio test for H_0: \pi_1 = \pi_2
. This test is two-sided and
always returns positive values, so the sign of its score is inverted for f_1 / n_1 < f_2 / n_2
in order to obtain a directional keyness measure.
As a pure significance measure, it tends to prefer high-frequency candidates with large f_1
.
LogRatio
A point estimate of the log relative risk \log_2 (\pi_1 / \pi_2)
, which has been suggested
as an intuitive keyness measure under the name LogRatio by Hardie (2014: 45).
The implementation uses Walter's (1975) adjusted estimator
%
\log_2 \dfrac{f_1 + \frac12}{n_1 + \frac12} - \log_2 \dfrac{f_2 + \frac12}{n_2 + \frac12}
which is less biased and robust against f_i = 0
.
As a pure effect-size measure, LogRatio tends to assign spuriously high scores to low-frequency candidates
with small f_1
and f_2
(due to sampling variation).
Combination with a significance filter is strongly recommended.
LRC
(default)A conservative estimate for LogRatio recommended by Evert (2022) in order to combine
and balance the advantages of effect-size and significance measures.
A confidence interval (according to the specified conf.level
) for relative risk r = \pi_1 / \pi_2
is obtained from an exact conditional Poisson test (Fay 2010: 55), adjusted for multiple testing by default.
If a candidate is not significant (i.e. the confidence interval includes H_0: r = 1
) its score is set to 0.
Otherwise the boundary of the confidence interval closer to 1 is taken as a conservative directional estimate
of r
and its \log_2
is returned.
PositiveLRC
A one-sided variant of LRC, which returns the lower boundary of a one-sided confidence interval
for \log_2 r
. Scores \leq 0
indicate that there is no significant evidence for positive keyness.
The directional version of LRC is recommended for general use, but PositiveLRC may be preferred if the
hermeneutic interpretation should also consider non-significant candidates (especially with small data sets).
SimpleMaths
The simple maths keyness measure (Kilgarriff 2009) used by the commercial corpus analysis platform Sketch Engine:
\dfrac{10^6 \cdot \frac{f_1}{n_1} + \lambda}{10^6 \cdot \frac{f_2}{n_2} + \lambda}
Its frequency bias can be adjusted with the user parameter \lambda > 0
. The scaling
factor 10^6
was chosen so that \lambda = 1
is a practical default value.
There does not appear to be a convincing mathematical justification behind this measure. It is included here only because of the popularity of the Sketch Engine platform.
Lockwords
This measure is designed for identifying so-called lockwords, whose frequencies are remarkably stable
across different corpora (Baker 2011: 73). It is based on the same confidence intervals for \log_2 r
as the LRC measure. Here, the maximum over all |\log_2 r|
values inside the confidence interval is taken
as a conservative upper bound for the true log ratio of the relative frequencies (which satisfies all the
requirements on such a measure discussed by Hardie 2014).
In line with its purpose, the Lockwords measure does not distinguish between positive and negative differences
and always returns a positive value. For example, a Lockwords value of 2 means that we have good evidence that
the candidate item is at most 4 times as frequent in one population than in the other. In other words,
the true relative risk r = \pi_1 / \pi_2
falls into the range [\frac14, 4]
.
Note that the Lockwords value will be +\infty
if f_1 = 0
or f_2 = 0
.
A numeric vector of the same length as f1
and f2
, containing keyness scores for all candidate lexical items.
For most measures, positive scores indicate positive keywords (i.e. higher frequency in the population underlying corpus A)
and negative scores indicate negative keywords (i.e. higher frequency in the population underlying corpus B).
If alpha
is specified, non-significant candidates always have a score of 0.
Stephanie Evert (https://purl.org/stephanie.evert)
Baker, P. (2011). Times may change, but we will always have money: Diachronic variation in recent British English. Journal of English Linguistics, 39(1):65-88.
Evert, S. (2022). Measuring keyness. In Digital Humanities 2022: Conference Abstracts, pages 202-205, Tokyo, Japan / online. https://osf.io/cy6mw/
Fay, Michael P. (2010). Two-sided exact tests and matching confidence intervals for discrete data. The R Journal, 2(1), 53-58.
Hardie, A. (2014). A single statistical technique for keywords, lockwords, and collocations. Internal CASS working paper no. 1, unpublished.
Kilgarriff, A. (2009). Simple maths for keywords. In Proceedings of the Corpus Linguistics 2009 Conference, Liverpool, UK.
Rayson, P. and Garside, R. (2000). Comparing corpora using frequency profiling. In Proceedings of the ACL Workshop on Comparing Corpora, pages 1-6, Hong Kong.
Walter, S. D. (1975). The distribution of Levin’s measure of attributable risk. Biometrika, 62(2): 371-374.
prop.cint
, which is used by the exact conditional Poisson test of the LRC measure
A gentle and detailed introduction to keyness measures can be found in Unit 4 of the SIGIL course at https://SIGIL.R-Forge.R-Project.org/, which explains the intuition of LRC visually. The unit also includes a worked example carrying out several keyword analyses with real-life corpus data as well as visualisation in the form of “scattertext” or semantic maps.
# compute all keyness measures for a single candidate item with f1=7, f2=2 and n1=n2=1000
keyness(7, 1000, 2, 1000, measure="G2") # log-likelihood
keyness(7, 1000, 2, 1000, measure="LogRatio")
keyness(7, 1000, 2, 1000, measure="LogRatio", alpha=0.05) # with significance filter
keyness(7, 1000, 2, 1000, measure="LRC") # the default measure
keyness(7, 1000, 2, 1000, measure="PositiveLRC")
keyness(7, 1000, 2, 1000, measure="SimpleMaths")
# a practical example: keywords of spoken British English (from BNC corpus)
n1 <- sum(BNCcomparison$spoken) # sample sizes
n2 <- sum(BNCcomparison$written)
kw <- transform(BNCcomparison,
G2 = keyness(spoken, n1, written, n2, measure="G2"),
LogRatio = keyness(spoken, n1, written, n2, measure="LogRatio"),
LRC = keyness(spoken, n1, written, n2))
kw <- kw[order(-kw$LogRatio), ]
head(kw, 20) # top LogRatio keywords
kw <- transform(kw,
Lock = keyness(spoken, n1, written, n2, measure="Lockwords"))
kw <- kw[order(kw$Lock), ] # note increasing sort
head(kw, 20) # top lockwords
# collocations of "in charge of" with LRC as an association measure
colloc <- transform(BNCInChargeOf,
PosLRC = keyness(f.in, N.in, f.out, N.out, measure="PositiveLRC"))
colloc <- colloc[order(-colloc$PosLRC), ]
head(colloc, 30)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.