Description Usage Arguments Details Value Author(s) References See Also Examples
Compute feature scores for a termdocument or termterm cooccurrence matrix, using one of several standard association measures. Scores can optionally be rescaled with an isotonic transformation function and centered or standardized. In addition, row vectors can be normalized to unit length wrt. a given norm.
This function has been optimized for efficiency and low memory overhead.
1 2 3 4 5 6 7  dsm.score(model, score = "frequency",
sparse = TRUE, negative.ok = NA,
transform = c("none", "log", "root", "sigmoid"),
scale = c("none", "standardize", "center", "scale"),
normalize = FALSE, method = "euclidean", p = 2, tol = 1e6,
matrix.only = FALSE, update.nnzero = FALSE,
batchsize = 1e6, gc.iter = Inf)

model 
a DSM model, i.e. an object of class 
score 
the association measure to be used for feature weighting; either a character string naming one of the builtin measures or a userdefined function (see “Details” below) 
sparse 
if 
negative.ok 
whether operations that introduce negative values into the score matrix (nonsparse association scores, standardization of columns, etc.) are allowed.
The default ( 
transform 
scale transformation to be applied to association scores (see “Details” below) 
scale 
if not 
normalize 
if 
method, p 
norm to be used with 
tol 
if 
matrix.only 
whether to return updated DSM model (default) or only the matrix of scores ( 
update.nnzero 
if 
batchsize 
if 
gc.iter 
how often to run the garbage collector when computing userdefined association scores; 
Association measures (AM) for feature scoring are defined in the notation of Evert (2008). The most important symbols are O11 = O for the observed cooccurrence frequency, E11 = E for the cooccurrence frequency expected under a null hypothesis of independence, R1 for the marginal frequency of the target term, C1 for the marginal frequency of the feature term or context, and N for the sample size of the underlying corpus. Evert (2008) explains in detail how these values are computed for different types of cooccurrence; practical examples can be found in the distributional semantics tutorial at http://wordspace.collocations.de/.
Several commonly used AMs are implemented in optimized C++ code for efficiency and minimal memory overhead. They are selected by name, which is passed as a character string in the score
argument. See below for a list of builtin measures and their full equations.
Other AMs can be applied by passing a userdefined function in the score
argument. See “Userdefined association measures” at the end of this section for details.
The names of the following measures can be abbreviated to a unique prefix. Equations are given in the notation of Evert (2008).
frequency
(default)Cooccurrence frequency:
O11
Use this association measure to operate on raw, unweighted cooccurrence frequency data.
MI
(Pointwise) Mutual Information, a logtransformed version of the ratio between observed and expected cooccurrence frequency:
log2(O11 / E11)
Pointwise MI has a very strong bias towards pairs with low expected cooccurrence frequency (because of E11 in the denominator). It should only be applied if lowfrequency targets and features have been removed from the DSM.
The sparse version of MI (with negative scores cut off at 0) is sometimes referred to as "positive pointwise Mutual Information" (PPMI) in the literature.
loglikelihood
The G^2 statistic of a likelihood ratio test for independence of rows and columns in a contingency table, which is very popular in computational linguistics under the name loglikelihood:
± 2 * ( SUM[ij] Oij * log(Oij / Eij) )
This implementation computes signed association scores, which are negative iff O11 < E11.
Loglikelihood has a strong bias towards high cooccurrence frequency and often produces a highly skewed distribution of scores. It may therefore be advisable to combine it with an additional log
transformation.
simplell
Simple loglikelihood (Evert 2008, p. 1225):
± 2 * ( O11 * log(O11 / E11)  (O11  E11) )
This measure provides a good approximation to the full loglikelihood measure (Evert 2008, p. 1235), but can be computed much more efficiently. It is also very similar to the localMI measure used by several popular DSMs.
Like loglikelihood
, this measure computes signed association scores and has a strong bias towards high cooccurrence frequency.
tscore
The tscore association measure, which is popular for collocation identification in computational lexicography:
(O11  E11) / sqrt(O11)
Tscore is known to filter out lowfrequency data effectively. If used as a nonsparse measure, a “discounted” version with √(O + 1) in the denominator is computed.
chisquared
The X^2 statistic of Pearson's chisquared test for independence of rows and columns in a contingency table, with Yates's correction applied:
± N * (O12 * O22  O12 * O21  N/2)^2 / (R1 * R2 * C1 * C2)
This implementation computes signed association scores, which are negative iff O11 < E11.
The formula above gives a more compact form of Yates's correction than the familiar sum over the four cells of the contingency table.
zscore
The zscore association measure, based on a normal approximation to the binomial distribution of cooccurrence by chance:
(O11  E11) / sqrt(E11)
Zscore has a strong bias towards pairs with low expected cooccurrence frequency (because of E11 in the denominator). It should only be applied if lowfrequency targets and features have been removed from the DSM.
Dice
The Dice coefficient of association, which corresponds to the harmonic mean of the conditional probabilities P(feature  target) and P(target  feature):
2 O11 / (R1 + C1)
Note that Dice is inherently sparse: it preserves zeroes and does not produce negative scores.
The following additional scoring functions can be selected:
tf.idf
The tfidf weighting scheme popular in Information Retrieval:
O11 * log(1 / df)
where df is the relative document frequency of the corresponding feature term and should be provided as a variable df
in the model's column information. Otherwise, it is approximated by the feature's nonzero count np (variable nnzero
) divided by the number K of rows in the cooccurrence matrix:
df = (np + 1) / (K + 1)
The discounting avoids divisionbyzero errors when np = 0.
reweight
Apply scale transformation, column scaling and/or row normalization to previously computed feature scores (from model$S
). This is the only score
that can be used with a DSM that does not contain raw cooccurrence frequency data.
If sparse=TRUE
, negative association scores are cut off at 0 in order to (i) ensure that the scored matrix is nonnegative and (ii) preserve sparseness. The implementation assumes that association scores are always ≤ 0 for O11 = 0 in this case and only computes scores for nonzero entries in a sparse matrix. All builtin association measures satisfy this criterion.
Other researchers sometimes refer to such sparse scores as "positive" measures, most notably positive pointwise Mutual Information (PPMI). Since sparse=TRUE
is the default setting, score="MI"
actually computes the PPMI measure.
Nonsparse association scores can only be computed if negative.ok=TRUE
and will force a dense matrix representation. For this reason, the default is FALSE
for a sparse cooccurrence matrix and TRUE
for a dense one. A special setting negative.ok="nonzero"
is provided for those who wish to abuse dsm.score
for collocation analysis. In combination with sparse=FALSE
, it will allow negative score values, but compute them only for the nonzero entries of a sparse cooccurrence matrix. For a dense cooccurrence matrix, this setting is fully equivalent to negative.ok=TRUE
.
Association scores can be rescaled with an isotonic transformation function that preserves sign and ranking of the scores. This is often done in order to deskew the distribution of scores or as an approximate binarization (presence vs. absence of features). The following builtin transformations are available:
none
(default)A linear transformation leaves association scores unchanged.
f(x) = x
log
The logarithmic transformation has a strong deskewing effect. In order to preserve sparseness and sign of association scores, a signed and discounted version has been implemented.
f(x) = sgn(x) * log(x + 1)
root
The signed square root transformation has a mild deskewing effect.
f(x) = sgn(x) * sqrt(x)
sigmoid
The sigmoid transformation produces a smooth binarization where negative values saturate at 1, positive values saturate at +1 and zeroes remain unchanged.
f(x) = tanh(x)
Instead of the name of a builtin AM, a function implementing a userdefined measure can be passed in the score
argument. This function will be applied to the cooccurrence matrix in batches of approximately batchsize
elements in order to limit the memory overhead incurred. A userdefined AM can be combined with any of the transformations above, and sparse=TRUE
will cut off all negative scores.
The user function can use any of following arguments to access the contingency tables of observed and expected frequencies, following the notation of Evert (2008):
O
, E
observed and expected cooccurrence frequency
R1
, R2
, C1
, C2
the row and column marginals of the contingency table
N
sample size
f
, f1
, f2
the frequency signature of a targetfeature pair, a different notation for f = O, f1 = R1 and f2 = C1
O11
, O12
, O21
, O22
the contingency table of observed frequencies
E11
, E12
, E21
, E22
the contingency table of expected frequencies
rows
a data frame containing information about the target items (from the rows
element of model
)
cols
a data frame containing information about the feature items (from the cols
element of model
)
...
must be specified to ignore unused arguments
Except for rows
and cols
, all these arguments will be numeric vectors of the same lengths or scalar values (N
), and the function must return a numeric vector of the same length.
For example, the builtin Mutual Information measure could also be implemented with the user function
1 
and tf.idf scoring could be implemented as follows, provided that the feature information table model$cols
contains a column df
with relative document frequencies:
1 2 
Warning: Userdefined AMs are much less efficient than the builtin measures and should only be used on large data sets if there is a good reason to do so. Increasing batchsize
may speed up the computation to some degree at the expense of bigger memory overhead.
Either an updated DSM model of class dsm
(default) or the matrix of (scaled and normalised) association scores (matrix.only=TRUE
).
Note that updating DSM models may require a substantial amount of temporary memory (because of the way memory management is implemented in R). This can be problematic when running a 32bit build of R or when dealing with very large DSM models, so it may be better to return only the scored matrix in such cases.
Stefan Evert (http://purl.org/stefan.evert)
More information about assocation measures and the notation for contingency tables can be found at http://www.collocations.de/ and in
Evert, Stefan (2008). Corpora and collocations. In A. Lüdeling and M. Kytö (eds.), Corpus Linguistics. An International Handbook, chapter 58, pages 1212–1248. Mouton de Gruyter, Berlin, New York.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35  model < DSM_TermTerm
model$M # raw cooccurrence matrix
model < dsm.score(model, score="MI")
round(model$S, 3) # PPMI scores
model < dsm.score(model, score="reweight", transform="sigmoid")
round(model$S, 3) # additional sigmoid transformation
## userdefined scoring functions can implement additional measures,
## e.g. the conditional probability Pr(feature  target) as a percentage
my.CP < function (O11, R1, ...) 100 * O11 / R1 # "..." is mandatory
model < dsm.score(model, score=my.CP)
round(model$S, 3)
## shifted PPMI (with k = 2) creates allzero rows and columns
model < dsm.score(model, score=function (O, E, ...) log2(O / E)  2,
normalize=TRUE, update.nnzero=TRUE)
round(model$S, 3) # normalization preserves allzero rows
## use subset to remove such rows and columns
m2 < subset(model, nnzero > 0, nnzero > 0) # must have updated nnzero counts
round(m2$S, 3)
## Not run:
# visualization of the scale transformations implemented by dsm.score
x < seq(2, 4, .025)
plot(x, x, type="l", lwd=2, xaxs="i", yaxs="i", xlab="x", ylab="f(x)")
abline(h=0, lwd=0.5); abline(v=0, lwd=0.5)
lines(x, sign(x) * log(abs(x) + 1), lwd=2, col=2)
lines(x, sign(x) * sqrt(abs(x)), lwd=2, col=3)
lines(x, tanh(x), lwd=2, col=4)
legend("topleft", inset=.05, bg="white", lwd=3, col=1:4,
legend=c("none", "log", "root", "sigmoid"))
## End(Not run)

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.