pmi: Calculate Pointwise Mutual Information (PMI).

Calculate Pointwise Mutual Information (PMI).


Calculate Pointwise Mutual Information as an information-theoretic approach to find collocations.


pmi(.Object, ...)

## S4 method for signature 'context'

## S4 method for signature 'Cooccurrences'

## S4 method for signature 'ngrams'
pmi(.Object, observed, p_attribute = p_attributes(.Object)[1])



An object.


Arguments methods may require.


A count-object with the numbers of the observed occurrences of the tokens in the input ngrams object.


The positional attribute which shall be considered. Relevant only if ngrams have been calculated for more than one p-attribute.


Pointwise mutual information (PMI) is calculated as follows (see Manning/Schuetze 1999):

I(x,y) = log\frac{p(x,y)}{p(x)p(y)}

The formula is based on maximum likelihood estimates: When we know the number of observations for token x, o_{x}, the number of observations for token y, o_{y} and the size of the corpus N, the propabilities for the tokens x and y, and for the co-occcurence of x and y are as follows:

p(x) = \frac{o_{x}}{N}

p(y) = \frac{o_{y}}{N}

The term p(x,y) is the number of observed co-occurrences of x and y.

Note that the computation uses log base 2, not the natural logarithm you find in examples (e.g.


Manning, Christopher D.; Schuetze, Hinrich (1999): Foundations of Statistical Natural Language Processing. MIT Press: Cambridge, Mass., pp. 178-183.

See Also

y <- cooccurrences("REUTERS", query = "oil", method = "pmi")
N <- size(y)[["partition"]]
I <- log2((y[["count_coi"]]/N) / ((count(y) / N) * (y[["count_partition"]] / N)))
use(pkg = "RcppCWB", corpus = "REUTERS")

dt <- decode(
  p_attribute = "word",
  s_attribute = character(), 
  to = "data.table",
  verbose = FALSE
n <- ngrams(dt, n = 2L, p_attribute = "word")
obs <- count("REUTERS", p_attribute = "word")
phrases <- pmi(n, observed = obs)

