Compute feature scores for a term-document or term-term co-occurrence matrix, using one of several standard association measures. Scores can optionally be rescaled with an isotonic transformation function and centered or standardized. In addition, row vectors can be normalized to unit length wrt. a given norm.
This function has been optimized for efficiency and low memory overhead.
1 2 3 4 5 6 7
dsm.score(model, score = "frequency", sparse = TRUE, negative.ok = NA, transform = c("none", "log", "root", "sigmoid"), scale = c("none", "standardize", "center", "scale"), normalize = FALSE, method = "euclidean", p = 2, tol = 1e-6, matrix.only = FALSE, update.nnzero = FALSE, batchsize = 1e6, gc.iter = Inf)
a DSM model, i.e. an object of class
the association measure to be used for feature weighting; either a character string naming one of the built-in measures or a user-defined function (see “Details” below)
whether operations that introduce negative values into the score matrix (non-sparse association scores, standardization of columns, etc.) are allowed.
The default (
scale transformation to be applied to association scores (see “Details” below)
norm to be used with
whether to return updated DSM model (default) or only the matrix of scores (
how often to run the garbage collector when computing user-defined association scores;
Association measures (AM) for feature scoring are defined in the notation of Evert (2008). The most important symbols are O11 = O for the observed co-occurrence frequency, E11 = E for the co-occurrence frequency expected under a null hypothesis of independence, R1 for the marginal frequency of the target term, C1 for the marginal frequency of the feature term or context, and N for the sample size of the underlying corpus. Evert (2008) explains in detail how these values are computed for different types of co-occurrence; practical examples can be found in the distributional semantics tutorial at http://wordspace.collocations.de/.
Several commonly used AMs are implemented in optimized C++ code for efficiency and minimal memory overhead. They are selected by name, which is passed as a character string in the
score argument. See below for a list of built-in measures and their full equations.
Other AMs can be applied by passing a user-defined function in the
score argument. See “User-defined association measures” at the end of this section for details.
The names of the following measures can be abbreviated to a unique prefix. Equations are given in the notation of Evert (2008).
Use this association measure to operate on raw, unweighted co-occurrence frequency data.
(Pointwise) Mutual Information, a log-transformed version of the ratio between observed and expected co-occurrence frequency:
log2(O11 / E11)
Pointwise MI has a very strong bias towards pairs with low expected co-occurrence frequency (because of E11 in the denominator). It should only be applied if low-frequency targets and features have been removed from the DSM.
The sparse version of MI (with negative scores cut off at 0) is sometimes referred to as "positive pointwise Mutual Information" (PPMI) in the literature.
The G^2 statistic of a likelihood ratio test for independence of rows and columns in a contingency table, which is very popular in computational linguistics under the name log-likelihood:
± 2 * ( SUM[ij] Oij * log(Oij / Eij) )
This implementation computes signed association scores, which are negative iff O11 < E11.
Log-likelihood has a strong bias towards high co-occurrence frequency and often produces a highly skewed distribution of scores. It may therefore be advisable to combine it with an additional
Simple log-likelihood (Evert 2008, p. 1225):
± 2 * ( O11 * log(O11 / E11) - (O11 - E11) )
This measure provides a good approximation to the full log-likelihood measure (Evert 2008, p. 1235), but can be computed much more efficiently. It is also very similar to the local-MI measure used by several popular DSMs.
log-likelihood, this measure computes signed association scores and has a strong bias towards high co-occurrence frequency.
The t-score association measure, which is popular for collocation identification in computational lexicography:
(O11 - E11) / sqrt(O11)
T-score is known to filter out low-frequency data effectively. If used as a non-sparse measure, a “discounted” version with √(O + 1) in the denominator is computed.
The X^2 statistic of Pearson's chi-squared test for independence of rows and columns in a contingency table, with Yates's correction applied:
± N * (|O12 * O22 - O12 * O21| - N/2)^2 / (R1 * R2 * C1 * C2)
This implementation computes signed association scores, which are negative iff O11 < E11.
The formula above gives a more compact form of Yates's correction than the familiar sum over the four cells of the contingency table.
The z-score association measure, based on a normal approximation to the binomial distribution of co-occurrence by chance:
(O11 - E11) / sqrt(E11)
Z-score has a strong bias towards pairs with low expected co-occurrence frequency (because of E11 in the denominator). It should only be applied if low-frequency targets and features have been removed from the DSM.
The Dice coefficient of association, which corresponds to the harmonic mean of the conditional probabilities P(feature | target) and P(target | feature):
2 O11 / (R1 + C1)
Note that Dice is inherently sparse: it preserves zeroes and does not produce negative scores.
The following additional scoring functions can be selected:
The tf-idf weighting scheme popular in Information Retrieval:
O11 * log(1 / df)
where df is the relative document frequency of the corresponding feature term and should be provided as a variable
df in the model's column information. Otherwise, it is approximated by the feature's nonzero count np (variable
nnzero) divided by the number K of rows in the co-occurrence matrix:
df = (np + 1) / (K + 1)
The discounting avoids division-by-zero errors when np = 0.
Apply scale transformation, column scaling and/or row normalization to previously computed feature scores (from
model$S). This is the only
score that can be used with a DSM that does not contain raw co-occurrence frequency data.
sparse=TRUE, negative association scores are cut off at 0 in order to (i) ensure that the scored matrix is non-negative and (ii) preserve sparseness. The implementation assumes that association scores are always ≤ 0 for O11 = 0 in this case and only computes scores for nonzero entries in a sparse matrix. All built-in association measures satisfy this criterion.
Other researchers sometimes refer to such sparse scores as "positive" measures, most notably positive point-wise Mutual Information (PPMI). Since
sparse=TRUE is the default setting,
score="MI" actually computes the PPMI measure.
Non-sparse association scores can only be computed if
negative.ok=TRUE and will force a dense matrix representation. For this reason, the default is
FALSE for a sparse co-occurrence matrix and
TRUE for a dense one. A special setting
negative.ok="nonzero" is provided for those who wish to abuse
dsm.score for collocation analysis. In combination with
sparse=FALSE, it will allow negative score values, but compute them only for the nonzero entries of a sparse co-occurrence matrix. For a dense co-occurrence matrix, this setting is fully equivalent to
Association scores can be re-scaled with an isotonic transformation function that preserves sign and ranking of the scores. This is often done in order to de-skew the distribution of scores or as an approximate binarization (presence vs. absence of features). The following built-in transformations are available:
A linear transformation leaves association scores unchanged.
f(x) = x
The logarithmic transformation has a strong de-skewing effect. In order to preserve sparseness and sign of association scores, a signed and discounted version has been implemented.
f(x) = sgn(x) * log(|x| + 1)
The signed square root transformation has a mild de-skewing effect.
f(x) = sgn(x) * sqrt(|x|)
The sigmoid transformation produces a smooth binarization where negative values saturate at -1, positive values saturate at +1 and zeroes remain unchanged.
f(x) = tanh(x)
Instead of the name of a built-in AM, a function implementing a user-defined measure can be passed in the
score argument. This function will be applied to the co-occurrence matrix in batches of approximately
batchsize elements in order to limit the memory overhead incurred. A user-defined AM can be combined with any of the transformations above, and
sparse=TRUE will cut off all negative scores.
The user function can use any of following arguments to access the contingency tables of observed and expected frequencies, following the notation of Evert (2008):
observed and expected co-occurrence frequency
the row and column marginals of the contingency table
the frequency signature of a target-feature pair, a different notation for f = O, f1 = R1 and f2 = C1
the contingency table of observed frequencies
the contingency table of expected frequencies
a data frame containing information about the target items (from the
rows element of
a data frame containing information about the feature items (from the
cols element of
must be specified to ignore unused arguments
cols, all these arguments will be numeric vectors of the same lengths or scalar values (
N), and the function must return a numeric vector of the same length.
For example, the built-in Mutual Information measure could also be implemented with the user function
and tf.idf scoring could be implemented as follows, provided that the feature information table
model$cols contains a column
df with relative document frequencies:
Warning: User-defined AMs are much less efficient than the built-in measures and should only be used on large data sets if there is a good reason to do so. Increasing
batchsize may speed up the computation to some degree at the expense of bigger memory overhead.
Either an updated DSM model of class
dsm (default) or the matrix of (scaled and normalised) association scores (
Note that updating DSM models may require a substantial amount of temporary memory (because of the way memory management is implemented in R). This can be problematic when running a 32-bit build of R or when dealing with very large DSM models, so it may be better to return only the scored matrix in such cases.
Stefan Evert (http://purl.org/stefan.evert)
More information about assocation measures and the notation for contingency tables can be found at http://www.collocations.de/ and in
Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212–1248. Mouton de Gruyter, Berlin, New York.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35
model <- DSM_TermTerm model$M # raw co-occurrence matrix model <- dsm.score(model, score="MI") round(model$S, 3) # PPMI scores model <- dsm.score(model, score="reweight", transform="sigmoid") round(model$S, 3) # additional sigmoid transformation ## user-defined scoring functions can implement additional measures, ## e.g. the conditional probability Pr(feature | target) as a percentage my.CP <- function (O11, R1, ...) 100 * O11 / R1 # "..." is mandatory model <- dsm.score(model, score=my.CP) round(model$S, 3) ## shifted PPMI (with k = 2) creates all-zero rows and columns model <- dsm.score(model, score=function (O, E, ...) log2(O / E) - 2, normalize=TRUE, update.nnzero=TRUE) round(model$S, 3) # normalization preserves all-zero rows ## use subset to remove such rows and columns m2 <- subset(model, nnzero > 0, nnzero > 0) # must have updated nnzero counts round(m2$S, 3) ## Not run: # visualization of the scale transformations implemented by dsm.score x <- seq(-2, 4, .025) plot(x, x, type="l", lwd=2, xaxs="i", yaxs="i", xlab="x", ylab="f(x)") abline(h=0, lwd=0.5); abline(v=0, lwd=0.5) lines(x, sign(x) * log(abs(x) + 1), lwd=2, col=2) lines(x, sign(x) * sqrt(abs(x)), lwd=2, col=3) lines(x, tanh(x), lwd=2, col=4) legend("topleft", inset=.05, bg="white", lwd=3, col=1:4, legend=c("none", "log", "root", "sigmoid")) ## End(Not run)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.