dfm_weight: Weight the feature frequencies in a dfm
In koheiw/quanteda.core: Quantitative Analysis of Textual Data

Description Usage Arguments Value References See Also Examples

Weight the feature frequencies in a dfm

dfm_weight(
  x,
  scheme = c("count", "prop", "propmax", "logcount", "boolean", "augmented", "logave",
    "logsmooth"),
  weights = NULL,
  base = 10,
  k = 0.5,
  smoothing = 0.5,
  force = FALSE
)

dfm_smooth(x, smoothing = 1)

`x`	document-feature matrix created by dfm
`scheme`	a label of the weight type: `count` tf_{ij}, an integer feature count (default when a dfm is created) `prop` the proportion of the feature counts of total feature counts (aka relative frequency), calculated as tf_{ij} / ∑_j tf_{ij} `propmax` the proportion of the feature counts of the highest feature count in a document, tf_{ij} / \textrm{max}_j tf_{ij} `logcount` take the 1 + the logarithm of each count, for the given base, or 0 if the count was zero: 1 + \textrm{log}_{base}(tf_{ij}) if tf_{ij} > 0, or 0 otherwise. `boolean` recode all non-zero counts as 1 `augmented` equivalent to k + (1 - k) * `dfm_weight(x, "propmax")` `logave` (1 + the log of the counts) / (1 + log of the average count within document), or \frac{1 + \textrm{log}_{base} tf_{ij}}{1 + \textrm{log}_{base}(∑_j tf_{ij} / N_i)} `logsmooth` log of the counts + `smooth`, or tf_{ij} + s
`weights`	if `scheme` is unused, then `weights` can be a named numeric vector of weights to be applied to the dfm, where the names of the vector correspond to feature labels of the dfm, and the weights will be applied as multipliers to the existing feature counts for the corresponding named features. Any features not named will be assigned a weight of 1.0 (meaning they will be unchanged).
`base`	base for the logarithm when `scheme` is `"logcount"` or `logave`
`k`	the k for the augmentation when `scheme = "augmented"`
`smoothing`	constant added to the dfm cells for smoothing, default is 1 for `dfm_smooth()` and 0.5 for `dfm_weight()`
`force`	logical; if `TRUE`, apply weighting scheme even if the dfm has been weighted before. This can result in invalid weights, such as as weighting by `"prop"` after applying `"logcount"`, or after having grouped a dfm using `dfm_group()`.

dfm_weight returns the dfm with weighted values. Note the because the default weighting scheme is "count", simply calling this function on an unweighted dfm will return the same object. Many users will want the normalized dfm consisting of the proportions of the feature counts within each document, which requires setting scheme = "prop".

dfm_smooth returns a dfm whose values have been smoothed by adding the smoothing amount. Note that this effectively converts a matrix from sparse to dense format, so may exceed memory requirements depending on the size of your input matrix.

Manning, C.D., Raghavan, P., & Schütze, H. (2008). An Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

docfreq()

dfmat1 <- dfm(data_corpus_inaugural)

dfmat2 <- dfm_weight(dfmat1, scheme = "prop")
topfeatures(dfmat2)
dfmat3 <- dfm_weight(dfmat1)
topfeatures(dfmat3)
dfmat4 <- dfm_weight(dfmat1, scheme = "logcount")
topfeatures(dfmat4)
dfmat5 <- dfm_weight(dfmat1, scheme = "logave")
topfeatures(dfmat5)

# combine these methods for more complex dfm_weightings, e.g. as in Section 6.4
# of Introduction to Information Retrieval
head(dfm_tfidf(dfmat1, scheme_tf = "logcount"))

# apply numeric weights
str <- c("apple is better than banana", "banana banana apple much better")
(dfmat6 <- dfm(str, remove = stopwords("english")))
dfm_weight(dfmat6, weights = c(apple = 5, banana = 3, much = 0.5))

# smooth the dfm
dfmat <- dfm(data_corpus_inaugural)
dfm_smooth(dfmat, 0.5)