tfidf: compute tf-idf weights from a dfm

Description Usage Arguments Details References Examples

View source: R/dfm_weight.R

Description

Compute tf-idf, inverse document frequency, and relative term frequency on document-feature matrices. See also weight.

Usage

1
tfidf(x, normalize = FALSE, scheme = "inverse", ...)

Arguments

x

object for which idf or tf-idf will be computed (a document-feature matrix)

normalize

if TRUE, use relative term frequency

scheme

scheme for docfreq

...

additional arguments passed to docfreq when calling tfidf

Details

tfidf computes term frequency-inverse document frequency weighting. The default is not to normalize term frequency (by computing relative term frequency within document) but this will be performed if normalize = TRUE.

References

Manning, C. D., Raghavan, P., & Schutze, H. (2008). Introduction to Information Retrieval. Cambridge University Press.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
head(data_dfm_LBGexample[, 5:10])
head(tfidf(data_dfm_LBGexample)[, 5:10])
docfreq(data_dfm_LBGexample)[5:15]
head(tf(data_dfm_LBGexample)[, 5:10])

# replication of worked example from
# https://en.wikipedia.org/wiki/Tf-idf#Example_of_tf.E2.80.93idf
(wikiDfm <- new("dfmSparse", 
                Matrix::Matrix(c(1,1,2,1,0,0, 1,1,0,0,2,3),
                   byrow = TRUE, nrow = 2,  
                   dimnames = list(docs = c("document1", "document2"), 
                     features = c("this", "is", "a", "sample", "another",
                                  "example")), sparse = TRUE)))
docfreq(wikiDfm)
tfidf(wikiDfm)

Example output

quanteda version 0.9.9.65
Disabling parallel computing

Attaching package: 'quanteda'

The following object is masked from 'package:utils':

    View

Document-feature matrix of: 6 documents, 6 features (61.1% sparse).
(showing first 6 documents and first 6 features)
    features
docs  E  F   G   H   I   J
  R1 45 78 115 146 158 146
  R2  0  2   3  10  22  45
  R3  0  0   0   0   0   0
  R4  0  0   0   0   0   0
  R5  0  0   0   0   0   0
  V1  0  0   0   2   3  10
Document-feature matrix of: 6 documents, 6 features (61.1% sparse).
(showing first 6 documents and first 6 features)
    features
docs        E          F         G        H        I        J
  R1 35.01681 37.2154579 54.868944 43.95038 47.56274 43.95038
  R2  0.00000  0.9542425  1.431364  3.01030  6.62266 13.54635
  R3  0.00000  0.0000000  0.000000  0.00000  0.00000  0.00000
  R4  0.00000  0.0000000  0.000000  0.00000  0.00000  0.00000
  R5  0.00000  0.0000000  0.000000  0.00000  0.00000  0.00000
  V1  0.00000  0.0000000  0.000000  0.60206  0.90309  3.01030
E F G H I J K L M N O 
1 2 2 3 3 3 4 4 4 4 4 
Document-feature matrix of: 6 documents, 6 features (61.1% sparse).
(showing first 6 documents and first 6 features)
    features
docs  E  F   G   H   I   J
  R1 45 78 115 146 158 146
  R2  0  2   3  10  22  45
  R3  0  0   0   0   0   0
  R4  0  0   0   0   0   0
  R5  0  0   0   0   0   0
  V1  0  0   0   2   3  10
Document-feature matrix of: 2 documents, 6 features (33.3% sparse).
2 x 6 sparse Matrix of class "dfmSparse"
           features
docs        this is a sample another example
  document1    1  1 2      1       0       0
  document2    1  1 0      0       2       3
   this      is       a  sample another example 
      2       2       1       1       1       1 
Document-feature matrix of: 2 documents, 6 features (33.3% sparse).
2 x 6 sparse Matrix of class "dfmSparse"
           features
docs        this is       a  sample another example
  document1    0  0 0.60206 0.30103 0       0      
  document2    0  0 0       0       0.60206 0.90309

quanteda documentation built on May 29, 2017, 11:37 p.m.

Search within the quanteda package
Search all R packages, documentation and source code