matrix: Term-Document Matrix

Description Usage Arguments Value See Also Examples

Description

Constructs or coerces to a term-document matrix or a document-term matrix.

Usage

1
2
3
4

Arguments

x

a corpus for the constructors and either a term-document matrix or a document-term matrix or a simple triplet matrix (package slam) or a term frequency vector for the coercing functions.

control

a named list of control options. There are local options which are evaluated for each document and global options which are evaluated once for the constructed matrix. Available local options are documented in termFreq and are internally delegated to a termFreq call.

This is different for a SimpleCorpus. In this case all options are processed in a fixed order in one pass to improve performance. It always uses the Boost (https://www.boost.org) Tokenizer (via Rcpp) and takes no custom functions as option arguments.

Available global options are:

bounds

A list with a tag global whose value must be an integer vector of length 2. Terms that appear in less documents than the lower bound bounds$global[1] or in more documents than the upper bound bounds$global[2] are discarded. Defaults to list(global = c(1, Inf)) (i.e., every term will be used).

weighting

A weighting function capable of handling a TermDocumentMatrix. It defaults to weightTf for term frequency weighting. Available weighting functions shipped with the tm package are weightTf, weightTfIdf, weightBin, and weightSMART.

...

the additional argument weighting (typically a WeightFunction) is allowed when coercing a simple triplet matrix to a term-document or document-term matrix.

Value

An object of class TermDocumentMatrix or class DocumentTermMatrix (both inheriting from a simple triplet matrix in package slam) containing a sparse term-document matrix or document-term matrix. The attribute weighting contains the weighting applied to the matrix.

See Also

termFreq for available local control options.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
data("crude")
tdm <- TermDocumentMatrix(crude,
                          control = list(removePunctuation = TRUE,
                                         stopwords = TRUE))
dtm <- DocumentTermMatrix(crude,
                          control = list(weighting =
                                         function(x)
                                         weightTfIdf(x, normalize =
                                                     FALSE),
                                         stopwords = TRUE))
inspect(tdm[202:205, 1:5])
inspect(tdm[c("price", "prices", "texas"), c("127", "144", "191", "194")])
inspect(dtm[1:5, 273:276])

s <- SimpleCorpus(VectorSource(unlist(lapply(crude, as.character))))
m <- TermDocumentMatrix(s,
                        control = list(removeNumbers = TRUE,
                                       stopwords = TRUE,
                                       stemming = TRUE))
inspect(m[c("price", "texa"), c("127", "144", "191", "194")])

Example output

Loading required package: NLP
<<TermDocumentMatrix (terms: 4, documents: 5)>>
Non-/sparse entries: 6/14
Sparsity           : 70%
Maximal term length: 9
Weighting          : term frequency (tf)
Sample             :
           Docs
Terms       127 144 191 194 211
  companies   1   1   0   0   0
  company     1   0   0   1   0
  companys    0   0   1   0   0
  compared    0   0   0   0   1
<<TermDocumentMatrix (terms: 3, documents: 4)>>
Non-/sparse entries: 8/4
Sparsity           : 33%
Maximal term length: 6
Weighting          : term frequency (tf)
Sample             :
        Docs
Terms    127 144 191 194
  price    2   1   2   2
  prices   3   5   0   0
  texas    1   0   0   2
<<DocumentTermMatrix (documents: 5, terms: 4)>>
Non-/sparse entries: 6/14
Sparsity           : 70%
Maximal term length: 9
Weighting          : term frequency - inverse document frequency (tf-idf)
Sample             :
     Terms
Docs  companies  company company's compared
  127  2.736966 2.321928  0.000000 0.000000
  144  2.736966 0.000000  0.000000 0.000000
  191  0.000000 0.000000  4.321928 0.000000
  194  0.000000 2.321928  0.000000 0.000000
  211  0.000000 0.000000  0.000000 2.736966
<<TermDocumentMatrix (terms: 2, documents: 4)>>
Non-/sparse entries: 6/2
Sparsity           : 25%
Maximal term length: 5
Weighting          : term frequency (tf)
Sample             :
       Docs
Terms   127 144 191 194
  price   5   6   2   2
  texa    1   0   0   2

tm documentation built on April 7, 2021, 3:01 a.m.