am_score: Compute association scores for collocation analysis (corpora)

am.scoreR Documentation

Compute association scores for collocation analysis (corpora)

Description

Compute a wide range of established association measures (Evert 2004, 2008) for collocation analysis based on the co-occurrence frequency of two lexical items as well as their individual “marginal” frequencies. Frequency data can be provided in a number of different ways for convenience, and user-defined measures can be specified as mathematical formulae using the flexible notation introduced by Evert (2004).

Usage


am.score(w1, w2, f, f1, f2, N, measure,
         span.size=1, log=FALSE, labels=FALSE,
         conf.level=.95, p.adjust=TRUE,
         param=list())

builtin.am

Arguments

w1

a character vector specifying the first lexical item of each co-occurrence pair. If of length 1, it is assumed to be the same for all pairs. May be omitted if f1 is an unnamed vector parallel to f.

w2

a character vector specifying the second lexical item of each co-occurrence pair. If of length 1, it is assumed to be the same for all pairs. May be omitted if f2 is an unnamed vector parallel to f.

f

a numeric vector specifying the co-occurrence frequency of each pair of lexical items (w_1, w_2)

f1

a numeric vector of first marginal frequencies f_1, labelled with the unique strings occurring in w1. Alternatively, an unlabelled numeric vector of the same length as f or of length 1 (implying that all pairs have the same first marginal f_1).

f2

a numeric vector of second marginal frequencies f_2, labelled with the unique strings occurring in w2. Alternatively, an unlabelled numeric vector of the same length as f or of length 1 (implying that all pairs have the same first marginal f_2).

N

a numeric vector specifying the sample size N underlying the contingency tables. It is normally a single value, which is the same for all co-occurrence pair, but a vector of the same length as f may also be specified.

measure

either a string specifying one of the built-in association measures or a function implementing a user-defined measure (see “Details” below)

span.size

for surface co-occurrence, specify the total size of each collocational span in tokens (e.g. span.size=6 for a L3/R3 span) in order to adjust first marginal frequencies as explained by Evert (2008: Sec. 4.1 + 5.1). No adjustment should be specified if contingency tables have already been obtained according to Evert (2004: 68-70).

log

if TRUE, apply a signed log transformation h(x) = \mathop{\text{sgn}}(x)\cdot \log_2(|x| + 1) to the computed association scores in order to deskew their distribution.

labels

if TRUE, add labels to the returned vector of assocation scores, constructed from w1 and w2

conf.level

the desired confidence level for association measures based on confidence intervals such as LRC (defaults to 95%)

p.adjust

if TRUE, apply a Bonferroni correction in order to control the family-wise error rate across all statistical tests carried out across the data set. This parameter is used by measures based on confidence intervals and measures that return (log) p-values.

param

a list of additional parameters passed to the 'param' argument of the selected association measure, with 'conf.level' and 'p.adjust' added automatically.

Details

This function computes a range of standard association measures for collocation analysis, as well as user-defined measures specified by mathematical formulae. It assumes familiarity with the basic approaches of collocation analysis as explained in Evert (2008) or in Unit 4 of the SIGIL course at https://SIGIL.R-Forge.R-Project.org/, as well as with the notation for contingency tables introduced there.

While the main purpose of the function is to determine the most strongly associated items in a large data set of co-occurrence pairs, it can also be applied to the collocates of a single node word. In this case, w1 and f1 only consist of a single item (the node word and its marginal frequency). Keep in mind that marginal frequencies f2 still need to be obtained from the entire corpus, not just from co-occurrences with the node.

Frequency data

Frequency signatures for all co-occurrence pairs (w_1, w_2) must be provided by the user in terms of a vector of co-occurrence frequencies f, as well as vectors of marginal frequencies f1 (for the lexical items w_1) and f2 (for the lexical items w_2). The underlying sample size N of the contingency tables also has to be specified, which is usually the same for all co-occurrence pairs.

Marginal frequencies are normally given as numeric vectors labelled with the distinct strings from w1 (for f1) and the distinct strings from w2 (for f2), respectively. These vectors are used to look up the appropriate marginal frequencies for each co-occurrence item based on w1 and w2. Alternatively, f1 and f2 can be unlabelled vectors of the same length as f (or of length 1), in which case w1 and w2 may be omitted.

Because of these different invocations, it is recommeded to always use named arguments to am.score().

Contingency tables are automatically derived from the frequency signatures internally, and can be accessed by user-defined measures using the flexible and convenient notation of Evert (2004, 2008).

Types of co-occurrence

Evert (2008: Sec. 3) distinguishes three types of co-occurrence, which have implications on how frequency signatures and contingency tables have to be constructed. While reading the full description in Evert (2008) is strongly encouraged, concise summaries are provided here for reference.

Syntactic co-occurrence only considers pairs of lexical items in a specific syntactic relation to each other, such as adjective-noun or verb-object. Frequency data are obtained by extracting all instances of this syntactic relation from a corpus, resulting in a list of pair tokens. The sample size N is the total number of such pair tokens, while co-occurrency frequency f is obtained by counting the pairs, and marginal frequencies f_1, f_2 by counting lexical items in the first or second element of the pairs, respectively. For a “word sketch” (or “word profile”) collecting multiple syntactic relations, the collocation analysis for each relation has to be carried out separately.

Textual co-occurrence refers to co-occurrence within text segments such as sentences, paragraphs, tweets, or chapters. In this case, the sample size N corresponds to the total number of relevant text segments, and f_1, f_2, f are “document frequencies”, i.e. the number of text segments in which w_1, w_2, or both lexical items occur.

Surface co-occurrence usually considers w_1 as the “node” of the analysis and looks for collocates w_2 that occur within a certain number of tokens around the node (the “collocational span”). The span can be symmetric, e.g. L3/R3 (three tokens to the left and right of the node) or asymmetric, e.g. L0/R5 (five tokens to the right of the node only). The span size k is the total number of tokens in each span (6 for L3/R3, and 5 for L0/R5). Evert (2008) recommends to obtain frequency signatures as follows: f from the co-occurrence counts within collocational spans, f_1 as the overall corpus frequency of each distinct node, f_2 as the overall corpus frequency of each distinct collocate, and sample size N as the total number of tokens in the corpus. Contingency tables and expected frequencies then need to be adjusted for the span size (Evert 2008: Sec. 4.1 + 5.1), which can easily be achieved with the argument span.size=k.

Built-in association measures

The am.score() function includes a wide selection of built-in association measures, which can be selected via their name in the measure argument. Please refer to Evert (2004, 2008) or http://www.collocations.de/AM/ for mathematical details on these measures and their properties as well as full equations. Formulae shown below use the flexible notation for contingency tables introduced there.

Some measures take additional paramters specified in the param argument. Measures that return (log) p-values apply a Bonferroni correction if p.adjust=TRUE (the default). Measures based on confidence intervals use the confidence level specified in conf.level, which is also adjusted by the Bonferrroni correction.

MI

(Pointwise) mutual information, the binary logarithm of the ratio between observed and expected co-occurrence frequency:

\log_2 \frac{O_{11}}{E_{11}}

Pointwise MI has a very strong bias towards pairs with low expected co-occurrence frequency (because of E_{11} in the denominator). It should usually be combined with frequency thresholds on f_1 and f_2.

MI.k

A heuristic variant of pointwise mutual information intended to counteract the low-frequency bias by raising O_{11} to power k:

\log_2 \frac{(O_{11})^k}{E_{11}}

The exponent k can be specified as a user parameter (e.g. param=list(k=3) for the popular MI3 measure) and defaults to k=2 otherwise.

G2

The G^2 statistic of a likelihood ratio test for independence of rows and columns in a contingency table, which is very popular in computational linguistics under the name log-likelihood:

\pm 2 \left( \sum_{ij} O_{ij}\cdot \log \frac{O_{ij}}{E_{ij}} \right)

This implementation computes signed association scores, which are negative iff O_{11} < E_{11}. Log-likelihood has a strong bias towards high co-occurrence frequency and often produces a highly skewed distribution of scores. It may therefore be advisable to combine it with an additional log transformation (log=TRUE).

G2.pv

The p-values corresponding to the G^2 scores of the likelihood ratio test. In order to achieve sensible scaling and ensure that larger values correspond to higher collocability, the negative base-10 logarithm -\log_{10} p is returned, with negative sign iff O_{11} < E_{11}. Family-wise error rates are controlled with a Bonferroni correction to the p-values if p.adjust=TRUE (the default). Note that scores above 2 correspond to a significance level p < .01.

Fisher.pv

The p-values of a one-sided Fisher's exact test for independence of rows and columns in a contingency table, conditioned on the marginal frequencies. As usual for p-values, the negative base-10 logarithm -\log_{10} p is returned, which is always non-negative (keep in mind that the one-sided test only detects positive association). Family-wise error rates are controlled with a Bonferroni correction to the p-values if p.adjust=TRUE (the default).

simple.ll

Simple log-likelihood (Evert 2008: 1225):

\pm 2 \left( O_{11}\cdot \log \frac{O_{11}}{E_{11}} - (O_{11} - E_{11}) \right)

This measure provides a good approximation to the full log-likelihood measure (Evert 2008: 1235), but can be computed much more efficiently. Like G2, this measure computes signed association scores and has a strong bias towards high co-occurrence frequency.

t

The t-score association measure, which is popular for collocation identification in computational lexicography:

\frac{O_{11} - E_{11}}{\sqrt{O_{11}}}

T-score is known to filter out low-frequency data effectively.

X2

The X^2 statistic of Pearson's chi-squared test for independence of rows and columns in a contingency table, with Yates's correction applied:

\pm \frac{ N \bigl( | O_{11}O_{22} - O_{12} O_{21} | - N/2 \bigr)^2 }{ R_1 R_2 C_1 C_2 }

This implementation computes signed association scores, which are negative iff O_{11} < E_{11}. The formula above gives a more compact form of Yates's correction than the familiar sum over the four cells of the contingency table. See also chisq.

z

The z-score association measure, based on a normal approximation to the binomial distribution of co-occurrence by chance:

\frac{O_{11} - E_{11}}{\sqrt{E_{11}}}

Z-score has a strong bias towards pairs with low expected co-occurrence frequency (because of E_{11} in the denominator). Like pointwise MI, it should usually be combined with frequency thresholds on f_1 and f_2.

Dice

The Dice coefficient of association, which corresponds to the harmonic mean of the conditional probabilities P(w_2|w_1) and P(w_1|w_2):

\frac{2 O_{11}}{R_1 + C_1}

odds.ratio

Discounted log odds ratio, an effect-size measure that is sensitive to small marginal frequencies:

\log \frac{(O_{11} + \frac12)(O_{22} + \frac12)}{(O_{12} + \frac12)(O_{21} + \frac12)}

DP

The asymmetric Delta P measure \Delta P_{2|1} = P(w_2|w_1) - P(w_2|\neg w_1) proposed by Gries (2013: 143-144):

\frac{O_{11}}{R_1} - \frac{O_{21}}{R_2}

LRC

The conservative LogRatio (LRC) keyness measure (see keyness) can also be applied as an association measure. The implementation here always computes PositiveLRC, i.e. the lower boundary of a one-sided confidence interval for \log_2 r, in order to ensure a sensible and consistent scaling of the association score. Please keep in mind that negative scores do not necessarily indicate negative association (O_{11} < E_{11}), but rather lack of significant evidence for a positive association.

Confidence intervals are determined at confidence level conf.level and adjusted with a Bonferroni correction if p.adjust=TRUE (the default). Note that LRC is applied to the rows of the contingency table rather than its columns, in order to obtain relative risk for w_2 given w_1.

User-defined association measures

User-defined association measures can be applied by passing a suitable function in the measure argument. This function needs to be fully vectorised and will be applied to all co-occurrence pairs in the data set.

It can use any of following arguments to access the contingency tables of observed and expected frequencies, following the notation of Evert (2008):

O, E

observed and expected co-occurrence frequency

R1, R2, C1, C2

the row and column marginals of the contingency table

N

sample size

f, f1, f2

the frequency signature of each co-occurrence pair, a different notation for f = O, f_1 = R_1 and f_2 = C_1

O11, O12, O21, O22

the contingency table of observed frequencies

E11, E12, E21, E22

the contingency table of expected frequencies

param

a list with additional user-specified parameters, always including conf.level and p.adjust

...

must be specified to ignore unused arguments

Except for param, all these arguments will be numeric vectors of the same length, and the function must return a numeric vector of the same length.

For example, the built-in MI measure could also be implemented with the user function

    my.MI <- function (O, E, ...) log2(O / E) 

As a matter of fact, all built-in association measures are implemented in this way. The corresponding formulae can be obtained from the list builtin.am.

Bonferroni correction

If p.adjust=TRUE, statistical inference is corrected for multiple testing in order to control family-wise error rates. This applies in particular to association measures based on confidence intervals (such as LRC) and versions that return (log) p-values (such as G2-pv). Note that the G2 scores themselves are never adjusted.

Family size m is automatically determined from the number of co-occurrence pairs processed in a single function call. Alternatively, the family size can be specified explicitly in the p.adjust argument, e.g. if a large data set is processed in multiple batches, or p.adjust=FALSE can be used to disable the correction.

For the adjustment, a highly conservative Bonferroni correction \alpha' = \alpha / m is applied to significance levels. Since the large candidate sets and sample sizes often found in corpus linguistics tend to produce large numbers of false positives, this conservative approach is considered to be useful.

Value

A numeric vector of the same length as f containing the desired association scores for all co-occurrence pairs. The vector is labelled with the corresponding co-occurrence pairs if label=TRUE.

For all standard association measures, larger values indicate higher collocability. There is usually no meaningful scale for interpretation of the scores, though many measures return positive scores for higher than expected co-occurrence frequency and negative scores for lower than expected co-occurrence frequency.

P-values are returned on a negative log-10 scale (-\log_{10} p), so a score of 2 corresponds to p = .01 and a score of 3 to p = .001. The score for p = .05 is approximately 1.3. These cutoffs can be used to implement a significance filter for collocation candidates.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

http://www.collocations.de/AM/

Evert, S. (2004). The Statistics of Word Cooccurrences: Word Pairs and Collocations. Dissertation, Institut für maschinelle Sprachverarbeitung, University of Stuttgart. Published in 2005, URN urn:nbn:de:bsz:93-opus-23714. Available from http://www.collocations.de/phd.html.

Evert, S. (2008). Corpora and collocations. In Lüdeling, A. and Kytö, M., editors, Corpus Linguistics. An International Handbook, chapter 58, pages 1212–1248. Mouton de Gruyter, Berlin, New York. (extended manuscript (PDF))

Evert, S. (2022). Measuring keyness. In Digital Humanities 2022: Conference Abstracts, pages 202-205, Tokyo, Japan / online. https://osf.io/cy6mw/

Gries, S. T. (2013). 50-something years of work on collocations: What is or should be next …. International Journal of Corpus Linguistics, 18(1):137–165.

See Also

prop.cint, which is used by some of the association measures based on confidence intervals; keyness for details on the LRC association measure (known as PositiveLRC there).

For a gentle introduction to assocation measures, contingency tables, and different notions of co-occurrence see Evert (2008) or Unit 4 of the SIGIL course at https://SIGIL.R-Forge.R-Project.org/. The unit also includes a worked example carrying out several collocation analyses with real-life corpus data.

Examples

## surface collocations with L2/R2 window
head(SurfaceColloc$cooc, 10) # table of word pairs with co-occurrence frequencies
head(SurfaceColloc$f1) # tables of marginal frequencies
head(SurfaceColloc$f2)

# add association scores (MI and p-values from log-likelihood test) to data frame
# keep in mind that we need to adjust expected frequencies for the total span size of 4 tokens
colloc <- transform(
  SurfaceColloc$cooc,
  MI = am.score(w1, w2, f, SurfaceColloc$f1, SurfaceColloc$f2, SurfaceColloc$N, 
                "MI", span.size=4),
  pval = am.score(w1, w2, f, SurfaceColloc$f1, SurfaceColloc$f2, SurfaceColloc$N, 
                  "G2.pv", span.size=4)
)
colloc <- colloc[order(colloc$MI, decreasing=TRUE), ] # ordered by MI scores
colloc

# apply significance filter
subset(colloc, pval > 3) # significant positive association at p < .001

# can also obtain labelled vector of scores
with(SurfaceColloc$cooc,
     am.score(w1, w2, f, SurfaceColloc$f1, SurfaceColloc$f2, SurfaceColloc$N, 
              "z", span.size=4, labels=TRUE))


## syntactic collocations from pre-computed frequency signatures (must specify named arguments!)
head(KrennPPV)
colloc <- transform(
  KrennPPV[, 1:9], # omit pre-computed association scores
  t = am.score(f=freq, f1=f.PP, f2=f.verb, N=N, measure="t"),
  OR = am.score(f=freq, f1=f.PP, f2=f.verb, N=N, measure="odds.ratio"))

all.equal(colloc$t, KrennPPV$t.score) # compare with pre-computed scores

colloc <- colloc[order(colloc$t, decreasing=TRUE), ]
head(colloc, 20) # top collocates (among all pairs)
subset(colloc, verb == "legen") # top collocates of verb "legen"


## collocations of "in charge of" with LRC as an association measure
colloc <- transform(
  BNCInChargeOf, # reconstruct frequency signature from O11, O12, C1, C2
  f = f.in,
  f1 = N.in,
  f2 = f.in + f.out,
  N = N.in + N.out
)
colloc <- transform(
  colloc,
  LRC = am.score("in charge of", collocate, f, f1, f2, N, "LRC"))
colloc <- colloc[order(colloc$LRC, decreasing=TRUE), ]
head(colloc, 20)

corpora documentation built on June 10, 2025, 3:01 a.m.

Related to am_score in corpora...