Weighting Schemes (Matrices)

Description

Calculates a weighted document-term matrix according to the chosen local and/or global weighting scheme.

Usage

1
2
3
4
5
6
7
8

Arguments

m

a document-term matrix.

Details

When combining a local and a global weighting scheme to be applied on a given textmatrix m via dtm = lw(m) \cdot gw(m), where

  • m is the given document-term matrix,

  • lw(m) is one of the local weight functions lw\_tf(), lw\_logtf(), lw\_bintf(), and

  • gw(m) is one of the global weight functions gw\_normalisation(), gw\_idf(), gw\_gfidf(), entropy(), gw\_entropy().

This set of weighting schemes includes the local weightings (lw) raw, log, binary and the global weightings (gw) normalisation, two versions of the inverse document frequency (idf), and entropy in both the original Shannon as well as in a slightly modified, more common version:

lw\_tf() returns a completely unmodified n \times m matrix (placebo function).

lw\_logtf() returns the logarithmised n \times m matrix. log(m_{i,j}+1) is applied on every cell.

lw\_bintf() returns binary values of the n \times m matrix. Every cell is assigned 1, iff the term frequency is not equal to 0.

gw\_normalisation() returns a normalised n \times m matrix. Every cell equals 1 divided by the square root of the document vector length.

gw\_idf() returns the inverse document frequency in a n \times m matrix. Every cell is 1 plus the logarithmus of the number of documents divided by the number of documents where the term appears.

gw\_gfidf() returns the global frequency multiplied with idf. Every cell equals the sum of the frequencies of one term divided by the number of documents where the term shows up.

entropy() returns the entropy (as defined by Shannon).

gw\_entropy() returns one plus entropy.

Be careful when folding in data into an existing lsa space: you may want to weight an additional textmatrix based on the same vocabulary with the global weights of the training data (not the new data)!

Value

Returns the weighted textmatrix of the same size and format as the input matrix.

Author(s)

Fridolin Wild f.wild@open.ac.uk§

References

Dumais, S. (1992) Enhancing Performance in Latent Semantic Indexing (LSI) Retrieval. Technical Report, Bellcore.

Nakov, P., Popova, A., and Mateev, P. (2001) Weight functions impact on LSA performance. In: Proceedings of the Recent Advances in Natural language processing, Bulgaria, pp.187-193.

Shannon, C. (1948) A Mathematical Theory of Communication. In: The Bell System Technical Journal 27(July), pp.379–423.

Examples

1
2
3
4
5
6
7
8
## use the logarithmised term frequency as local weight and 
## the inverse document frequency as global weight.

vec1 = c( 1, 1, 1, 0, 0, 0, 0, 0, 0, 0, 0, 0 )
vec2 = c( 0, 0, 1, 1, 1, 1, 1, 0, 1, 0, 0, 0 )
vec3 = c( 0, 1, 0, 1, 1, 0, 0, 1, 0, 0, 0, 0 )
matrix = cbind(vec1,vec2, vec3)
weighted = lw_logtf(matrix)*gw_idf(matrix)