weight the feature frequencies in a dfm

Share:

Description

Returns a document by feature matrix with the feature frequencies weighted according to one of several common methods.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
weight(x, type, ...)

## S4 method for signature 'dfm,character'
weight(x, type = c("frequency", "relFreq",
  "relMaxFreq", "logFreq", "tfidf"), ...)

## S4 method for signature 'dfm,numeric'
weight(x, type, ...)

smoother(x, smoothing = 1)

Arguments

x

document-feature matrix created by dfm

type

a label of the weight type, or a named numeric vector of values to apply to the dfm. One of:

"frequency"

integer feature count (default when a dfm is created)

"relFreq"

the proportion of the feature counts of total feature counts (aka relative frequency)

"relMaxFreq"

the proportion of the feature counts of the highest feature count in a document

"logFreq"

natural logarithm of the feature count

"tfidf"

Term-frequency * inverse document frequency. For a full explanation, see, for example, http://nlp.stanford.edu/IR-book/html/htmledition/term-frequency-and-weighting-1.html. This implementation will not return negative values. For finer-grained control, call tfidf directly.

a named numeric vector

a named numeric vector of weights to be applied to the dfm, where the names of the vector correspond to feature labels of the dfm, and the weights will be applied as multipliers to the existing feature counts for the corresponding named fatures. Any features not named will be assigned a weight of 1.0 (meaning they will be unchanged).

...

not currently used. For finer grained control, consider calling tf or tfidf directly.

smoothing

constant added to the dfm cells for smoothing, default is 1

Details

This converts a matrix from sparse to dense format, so may exceed memory requirements depending on the size of your input matrix.

Value

The dfm with weighted values.

Author(s)

Paul Nulty and Kenneth Benoit

References

Manning, Christopher D., Prabhakar Raghavan, and Hinrich Schutze. Introduction to Information Retrieval. Vol. 1. Cambridge: Cambridge University Press, 2008.

See Also

tfidf

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
dtm <- dfm(inaugCorpus)
x <- apply(dtm, 1, function(tf) tf/max(tf))
topfeatures(dtm)
normDtm <- weight(dtm, "relFreq")
topfeatures(normDtm)
maxTfDtm <- weight(dtm, type="relMaxFreq")
topfeatures(maxTfDtm)
logTfDtm <- weight(dtm, type="logFreq")
topfeatures(logTfDtm)
tfidfDtm <- weight(dtm, type="tfidf")
topfeatures(tfidfDtm)

# combine these methods for more complex weightings, e.g. as in Section 6.4
# of Introduction to Information Retrieval
head(logTfDtm <- weight(dtm, type="logFreq"))
head(tfidf(logTfDtm, normalize = FALSE))

# apply numeric weights
str <- c("apple is better than banana", "banana banana apple much better")
weights <- c(apple = 5, banana = 3, much = 0.5)
(mydfm <- dfm(str, ignoredFeatures = stopwords("english"), verbose = FALSE))
weight(mydfm, weights)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.