am.score | R Documentation |
Compute a wide range of established association measures (Evert 2004, 2008) for collocation analysis based on the co-occurrence frequency of two lexical items as well as their individual “marginal” frequencies. Frequency data can be provided in a number of different ways for convenience, and user-defined measures can be specified as mathematical formulae using the flexible notation introduced by Evert (2004).
am.score(w1, w2, f, f1, f2, N, measure,
span.size=1, log=FALSE, labels=FALSE,
conf.level=.95, p.adjust=TRUE,
param=list())
builtin.am
w1 |
a character vector specifying the first lexical item of each co-occurrence pair. If of length 1, it is assumed to be the same for all pairs. May be omitted if |
w2 |
a character vector specifying the second lexical item of each co-occurrence pair. If of length 1, it is assumed to be the same for all pairs. May be omitted if |
f |
a numeric vector specifying the co-occurrence frequency of each pair of lexical items |
f1 |
a numeric vector of first marginal frequencies |
f2 |
a numeric vector of second marginal frequencies |
N |
a numeric vector specifying the sample size |
measure |
either a string specifying one of the built-in association measures or a function implementing a user-defined measure (see “Details” below) |
span.size |
for surface co-occurrence, specify the total size of each collocational span in tokens (e.g. |
log |
if TRUE, apply a signed log transformation |
labels |
if TRUE, add labels to the returned vector of assocation scores, constructed from |
conf.level |
the desired confidence level for association measures based on confidence intervals such as |
p.adjust |
if |
param |
a list of additional parameters passed to the 'param' argument of the selected association measure, with 'conf.level' and 'p.adjust' added automatically. |
This function computes a range of standard association measures for collocation analysis, as well as user-defined measures specified by mathematical formulae. It assumes familiarity with the basic approaches of collocation analysis as explained in Evert (2008) or in Unit 4 of the SIGIL course at https://SIGIL.R-Forge.R-Project.org/, as well as with the notation for contingency tables introduced there.
While the main purpose of the function is to determine the most strongly associated items in a large data set of co-occurrence pairs, it can also be applied to the collocates of a single node word. In this case, w1
and f1
only consist of a single item (the node word and its marginal frequency). Keep in mind that marginal frequencies f2
still need to be obtained from the entire corpus, not just from co-occurrences with the node.
Frequency signatures for all co-occurrence pairs (w_1, w_2)
must be provided by the user in terms of a vector of co-occurrence frequencies f
, as well as vectors of marginal frequencies f1
(for the lexical items w_1
) and f2
(for the lexical items w_2
). The underlying sample size N
of the contingency tables also has to be specified, which is usually the same for all co-occurrence pairs.
Marginal frequencies are normally given as numeric vectors labelled with the distinct strings from w1
(for f1
) and the distinct strings from w2
(for f2
), respectively. These vectors are used to look up the appropriate marginal frequencies for each co-occurrence item based on w1
and w2
. Alternatively, f1
and f2
can be unlabelled vectors of the same length as f
(or of length 1), in which case w1
and w2
may be omitted.
Because of these different invocations, it is recommeded to always use named arguments to am.score()
.
Contingency tables are automatically derived from the frequency signatures internally, and can be accessed by user-defined measures using the flexible and convenient notation of Evert (2004, 2008).
Evert (2008: Sec. 3) distinguishes three types of co-occurrence, which have implications on how frequency signatures and contingency tables have to be constructed. While reading the full description in Evert (2008) is strongly encouraged, concise summaries are provided here for reference.
Syntactic co-occurrence only considers pairs of lexical items in a specific syntactic relation to each other, such as adjective-noun or verb-object. Frequency data are obtained by extracting all instances of this syntactic relation from a corpus, resulting in a list of pair tokens. The sample size N
is the total number of such pair tokens, while co-occurrency frequency f
is obtained by counting the pairs, and marginal frequencies f_1, f_2
by counting lexical items in the first or second element of the pairs, respectively. For a “word sketch” (or “word profile”) collecting multiple syntactic relations, the collocation analysis for each relation has to be carried out separately.
Textual co-occurrence refers to co-occurrence within text segments such as sentences, paragraphs, tweets, or chapters. In this case, the sample size N
corresponds to the total number of relevant text segments, and f_1, f_2, f
are “document frequencies”, i.e. the number of text segments in which w_1, w_2
, or both lexical items occur.
Surface co-occurrence usually considers w_1
as the “node” of the analysis and looks for collocates w_2
that occur within a certain number of tokens around the node (the “collocational span”). The span can be symmetric, e.g. L3/R3 (three tokens to the left and right of the node) or asymmetric, e.g. L0/R5 (five tokens to the right of the node only). The span size k
is the total number of tokens in each span (6 for L3/R3, and 5 for L0/R5). Evert (2008) recommends to obtain frequency signatures as follows: f
from the co-occurrence counts within collocational spans, f_1
as the overall corpus frequency of each distinct node, f_2
as the overall corpus frequency of each distinct collocate, and sample size N
as the total number of tokens in the corpus. Contingency tables and expected frequencies then need to be adjusted for the span size (Evert 2008: Sec. 4.1 + 5.1), which can easily be achieved with the argument span.size=k
.
The am.score()
function includes a wide selection of built-in association measures, which can be selected via their name in the measure
argument. Please refer to Evert (2004, 2008) or http://www.collocations.de/AM/ for mathematical details on these measures and their properties as well as full equations. Formulae shown below use the flexible notation for contingency tables introduced there.
Some measures take additional paramters specified in the param
argument. Measures that return (log) p-values apply a Bonferroni correction if p.adjust=TRUE
(the default). Measures based on confidence intervals use the confidence level specified in conf.level
, which is also adjusted by the Bonferrroni correction.
MI
(Pointwise) mutual information, the binary logarithm of the ratio between observed and expected co-occurrence frequency:
\log_2 \frac{O_{11}}{E_{11}}
Pointwise MI has a very strong bias towards pairs with low expected co-occurrence frequency (because of E_{11}
in the denominator). It should usually be combined with frequency thresholds on f_1
and f_2
.
MI.k
A heuristic variant of pointwise mutual information intended to counteract the low-frequency bias by raising O_{11}
to power k
:
\log_2 \frac{(O_{11})^k}{E_{11}}
The exponent k
can be specified as a user parameter (e.g. param=list(k=3)
for the popular MI3 measure) and defaults to k=2
otherwise.
G2
The G^2
statistic of a likelihood ratio test for independence of rows and columns in a contingency table, which is very popular in computational linguistics under the name log-likelihood:
\pm 2 \left( \sum_{ij} O_{ij}\cdot \log \frac{O_{ij}}{E_{ij}} \right)
This implementation computes signed association scores, which are negative iff O_{11} < E_{11}
.
Log-likelihood has a strong bias towards high co-occurrence frequency and often produces a highly skewed distribution of scores. It may therefore be advisable to combine it with an additional log transformation (log=TRUE
).
G2.pv
The p-values corresponding to the G^2
scores of the likelihood ratio test. In order to achieve sensible scaling and ensure that larger values correspond to higher collocability, the negative base-10 logarithm -\log_{10} p
is returned, with negative sign iff O_{11} < E_{11}
. Family-wise error rates are controlled with a Bonferroni correction to the p-values if p.adjust=TRUE
(the default). Note that scores above 2 correspond to a significance level p < .01
.
Fisher.pv
The p-values of a one-sided Fisher's exact test for independence of rows and columns in a contingency table, conditioned on the marginal frequencies. As usual for p-values, the negative base-10 logarithm -\log_{10} p
is returned, which is always non-negative (keep in mind that the one-sided test only detects positive association). Family-wise error rates are controlled with a Bonferroni correction to the p-values if p.adjust=TRUE
(the default).
simple.ll
Simple log-likelihood (Evert 2008: 1225):
\pm 2 \left( O_{11}\cdot \log \frac{O_{11}}{E_{11}} - (O_{11} - E_{11}) \right)
This measure provides a good approximation to the full log-likelihood measure (Evert 2008: 1235), but can be computed much more efficiently. Like G2
, this measure computes signed association scores and has a strong bias towards high co-occurrence frequency.
t
The t-score association measure, which is popular for collocation identification in computational lexicography:
\frac{O_{11} - E_{11}}{\sqrt{O_{11}}}
T-score is known to filter out low-frequency data effectively.
X2
The X^2
statistic of Pearson's chi-squared test for independence of rows and columns in a contingency table, with Yates's correction applied:
\pm \frac{
N \bigl( | O_{11}O_{22} - O_{12} O_{21} | - N/2 \bigr)^2
}{
R_1 R_2 C_1 C_2
}
This implementation computes signed association scores, which are negative iff O_{11} < E_{11}
. The formula above gives a more compact form of Yates's correction than the familiar sum over the four cells of the contingency table. See also chisq
.
z
The z-score association measure, based on a normal approximation to the binomial distribution of co-occurrence by chance:
\frac{O_{11} - E_{11}}{\sqrt{E_{11}}}
Z-score has a strong bias towards pairs with low expected co-occurrence frequency (because of E_{11}
in the denominator). Like pointwise MI, it should usually be combined with frequency thresholds on f_1
and f_2
.
Dice
The Dice coefficient of association, which corresponds to the harmonic mean of the conditional probabilities P(w_2|w_1)
and P(w_1|w_2)
:
\frac{2 O_{11}}{R_1 + C_1}
odds.ratio
Discounted log odds ratio, an effect-size measure that is sensitive to small marginal frequencies:
\log \frac{(O_{11} + \frac12)(O_{22} + \frac12)}{(O_{12} + \frac12)(O_{21} + \frac12)}
DP
The asymmetric Delta P measure \Delta P_{2|1} = P(w_2|w_1) - P(w_2|\neg w_1)
proposed by Gries (2013: 143-144):
\frac{O_{11}}{R_1} - \frac{O_{21}}{R_2}
LRC
The conservative LogRatio (LRC) keyness measure (see keyness
) can also be applied as an association measure. The implementation here always computes PositiveLRC
, i.e. the lower boundary of a one-sided confidence interval for \log_2 r
, in order to ensure a sensible and consistent scaling of the association score. Please keep in mind that negative scores do not necessarily indicate negative association (O_{11} < E_{11}
), but rather lack of significant evidence for a positive association.
Confidence intervals are determined at confidence level conf.level
and adjusted with a Bonferroni correction if p.adjust=TRUE
(the default). Note that LRC is applied to the rows of the contingency table rather than its columns, in order to obtain relative risk for w_2
given w_1
.
User-defined association measures can be applied by passing a suitable function in the measure
argument. This function needs to be fully vectorised and will be applied to all co-occurrence pairs in the data set.
It can use any of following arguments to access the contingency tables of observed and expected frequencies, following the notation of Evert (2008):
O
, E
observed and expected co-occurrence frequency
R1
, R2
, C1
, C2
the row and column marginals of the contingency table
N
sample size
f
, f1
, f2
the frequency signature of each co-occurrence pair, a different notation for f = O
, f_1 = R_1
and f_2 = C_1
O11
, O12
, O21
, O22
the contingency table of observed frequencies
E11
, E12
, E21
, E22
the contingency table of expected frequencies
param
a list with additional user-specified parameters, always including conf.level
and p.adjust
...
must be specified to ignore unused arguments
Except for param
, all these arguments will be numeric vectors of the same length, and the function must return a numeric vector of the same length.
For example, the built-in MI measure could also be implemented with the user function
my.MI <- function (O, E, ...) log2(O / E)
As a matter of fact, all built-in association measures are implemented in this way. The corresponding formulae can be obtained from the list builtin.am
.
If p.adjust=TRUE
, statistical inference is corrected for multiple testing in order to control family-wise error rates. This applies in particular to association measures based on confidence intervals (such as LRC
) and versions that return (log) p-values (such as G2-pv
). Note that the G2
scores themselves are never adjusted.
Family size m
is automatically determined from the number of co-occurrence pairs processed in a single function call. Alternatively, the family size can be specified explicitly in the p.adjust
argument, e.g. if a large data set is processed in multiple batches, or p.adjust=FALSE
can be used to disable the correction.
For the adjustment, a highly conservative Bonferroni correction \alpha' = \alpha / m
is applied to significance levels. Since the large candidate sets and sample sizes often found in corpus linguistics tend to produce large numbers of false positives, this conservative approach is considered to be useful.
A numeric vector of the same length as f
containing the desired association scores for all co-occurrence pairs. The vector is labelled with the corresponding co-occurrence pairs if label=TRUE
.
For all standard association measures, larger values indicate higher collocability. There is usually no meaningful scale for interpretation of the scores, though many measures return positive scores for higher than expected co-occurrence frequency and negative scores for lower than expected co-occurrence frequency.
P-values are returned on a negative log-10 scale (-\log_{10} p
), so a score of 2 corresponds to p = .01
and a score of 3 to p = .001
. The score for p = .05
is approximately 1.3. These cutoffs can be used to implement a significance filter for collocation candidates.
Stephanie Evert (https://purl.org/stephanie.evert)
http://www.collocations.de/AM/
Evert, S. (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. Dissertation, Institut für maschinelle Sprachverarbeitung, University of Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714. Available from http://www.collocations.de/phd.html.
Evert, S. (2008). Corpora and collocations. In Lüdeling, A. and Kytö, M., editors, Corpus Linguistics. An International Handbook, chapter 58, pages 1212–1248. Mouton de Gruyter, Berlin, New York. (extended manuscript (PDF))
Evert, S. (2022). Measuring keyness. In Digital Humanities 2022: Conference Abstracts, pages 202-205, Tokyo, Japan / online. https://osf.io/cy6mw/
Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next …. International Journal of Corpus Linguistics, 18(1):137–165.
prop.cint
, which is used by some of the association measures based on confidence intervals; keyness
for details on the LRC association measure (known as PositiveLRC
there).
For a gentle introduction to assocation measures, contingency tables, and different notions of co-occurrence see Evert (2008) or Unit 4 of the SIGIL course at https://SIGIL.R-Forge.R-Project.org/. The unit also includes a worked example carrying out several collocation analyses with real-life corpus data.
## surface collocations with L2/R2 window
head(SurfaceColloc$cooc, 10) # table of word pairs with co-occurrence frequencies
head(SurfaceColloc$f1) # tables of marginal frequencies
head(SurfaceColloc$f2)
# add association scores (MI and p-values from log-likelihood test) to data frame
# keep in mind that we need to adjust expected frequencies for the total span size of 4 tokens
colloc <- transform(
SurfaceColloc$cooc,
MI = am.score(w1, w2, f, SurfaceColloc$f1, SurfaceColloc$f2, SurfaceColloc$N,
"MI", span.size=4),
pval = am.score(w1, w2, f, SurfaceColloc$f1, SurfaceColloc$f2, SurfaceColloc$N,
"G2.pv", span.size=4)
)
colloc <- colloc[order(colloc$MI, decreasing=TRUE), ] # ordered by MI scores
colloc
# apply significance filter
subset(colloc, pval > 3) # significant positive association at p < .001
# can also obtain labelled vector of scores
with(SurfaceColloc$cooc,
am.score(w1, w2, f, SurfaceColloc$f1, SurfaceColloc$f2, SurfaceColloc$N,
"z", span.size=4, labels=TRUE))
## syntactic collocations from pre-computed frequency signatures (must specify named arguments!)
head(KrennPPV)
colloc <- transform(
KrennPPV[, 1:9], # omit pre-computed association scores
t = am.score(f=freq, f1=f.PP, f2=f.verb, N=N, measure="t"),
OR = am.score(f=freq, f1=f.PP, f2=f.verb, N=N, measure="odds.ratio"))
all.equal(colloc$t, KrennPPV$t.score) # compare with pre-computed scores
colloc <- colloc[order(colloc$t, decreasing=TRUE), ]
head(colloc, 20) # top collocates (among all pairs)
subset(colloc, verb == "legen") # top collocates of verb "legen"
## collocations of "in charge of" with LRC as an association measure
colloc <- transform(
BNCInChargeOf, # reconstruct frequency signature from O11, O12, C1, C2
f = f.in,
f1 = N.in,
f2 = f.in + f.out,
N = N.in + N.out
)
colloc <- transform(
colloc,
LRC = am.score("in charge of", collocate, f, f1, f2, N, "LRC"))
colloc <- colloc[order(colloc$LRC, decreasing=TRUE), ]
head(colloc, 20)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.