# docfreq: Compute the (weighted) document frequency of a feature In quanteda/quanteda: Quantitative Analysis of Textual Data

## Description

For a dfm object, returns a (weighted) document frequency for each term. The default is a simple count of the number of documents in which a feature occurs more than a given frequency threshold. (The default threshold is zero, meaning that any feature occurring at least once in a document will be counted.)

## Usage

 1 2 3 docfreq(x, scheme = c("count", "inverse", "inversemax", "inverseprob", "unary"), smoothing = 0, k = 0, base = 10, threshold = 0, use.names = TRUE) 

## Arguments

 x a dfm scheme type of document frequency weighting, computed as follows, where N is defined as the number of documents in the dfm and s is the smoothing constant: countdf_j, the number of documents for which n_{ij} > threshold inverse\textrm{log}_{base}≤ft(s + \frac{N}{k + df_j}\right) inversemax\textrm{log}_{base}≤ft(s + \frac{\textrm{max}(df_j)}{k + df_j}\right) inverseprob\textrm{log}_{base}≤ft(\frac{N - df_j}{k + df_j}\right) unary1 for each feature smoothing added to the quotient before taking the logarithm k added to the denominator in the "inverse" weighting types, to prevent a zero document count for a term base the base with respect to which logarithms in the inverse document frequency weightings are computed; default is 10 (see Manning, Raghavan, and Schütze 2008, p123). threshold numeric value of the threshold above which a feature will considered in the computation of document frequency. The default is 0, meaning that a feature's document frequency will be the number of documents in which it occurs greater than zero times. use.names logical; if TRUE attach feature labels as names of the resulting numeric vector ... not used

## Value

a numeric vector of document frequencies for each feature

## References

Manning, C. D., Raghavan, P., & Schütze, H. (2008). Introduction to Information Retrieval. Cambridge: Cambridge University Press. https://nlp.stanford.edu/IR-book/pdf/irbookonlinereading.pdf

## Examples

  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 dfmat1 <- dfm(data_corpus_inaugural[1:2]) docfreq(dfmat1[, 1:20]) # replication of worked example from # https://en.wikipedia.org/wiki/Tf-idf#Example_of_tf.E2.80.93idf dfmat2 <- matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3), byrow = TRUE, nrow = 2, dimnames = list(docs = c("document1", "document2"), features = c("this", "is", "a", "sample", "another", "example"))) %>% as.dfm() dfmat2 docfreq(dfmat2) docfreq(dfmat2, scheme = "inverse") docfreq(dfmat2, scheme = "inverse", k = 1, smoothing = 1) docfreq(dfmat2, scheme = "unary") docfreq(dfmat2, scheme = "inversemax") docfreq(dfmat2, scheme = "inverseprob") 

quanteda/quanteda documentation built on June 15, 2019, 8:36 a.m.