termFreq: Term Frequency Vector

Description Usage Arguments Value See Also Examples

View source: R/matrix.R


Generate a term frequency vector from a text document.


termFreq(doc, control = list())



An object inheriting from TextDocument or a character vector.


A list of control options which override default settings.

First, following two options are processed.


A function tokenizing a TextDocument into single tokens, a Span_Tokenizer, Token_Tokenizer, or a string matching one of the predefined tokenization functions:


for Boost_tokenizer, or


for MC_tokenizer, or


for scan_tokenizer, or


for words.

Defaults to words.


Either a logical value indicating whether characters should be translated to lower case or a custom function converting characters to lower case. Defaults to tolower.

Next, a set of options which are sensitive to the order of occurrence in the control list. Options are processed in the same order as specified. User-specified options have precedence over the default ordering so that first all user-specified options and then all remaining options (with the default settings and in the order as listed below) are processed.


A character giving the language (preferably as IETF language tags, see language in package NLP) to be used for stopwords and stemming if not provided by doc.


A logical value indicating whether punctuation characters should be removed from doc, a custom function which performs punctuation removal, or a list of arguments for removePunctuation. Defaults to FALSE.


A logical value indicating whether numbers should be removed from doc or a custom function for number removal. Defaults to FALSE.


Either a Boolean value indicating stopword removal using default language specific stopword lists shipped with this package, a character vector holding custom stopwords, or a custom function for stopword removal. Defaults to FALSE.


Either a Boolean value indicating whether tokens should be stemmed or a custom stemming function. Defaults to FALSE.

Finally, following options are processed in the given order.


A character vector to be tabulated against. No other terms will be listed in the result. Defaults to NULL which means that all terms in doc are listed.


A list with a tag local whose value must be an integer vector of length 2. Terms that appear less often in doc than the lower bound bounds$local[1] or more often than the upper bound bounds$local[2] are discarded. Defaults to list(local = c(1, Inf)) (i.e., every token will be used).


An integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.


A table of class c("term_frequency", "integer") with term frequencies as values and tokens as names.

See Also



strsplit_space_tokenizer <- function(x)
    unlist(strsplit(as.character(x), "[[:space:]]+"))
ctrl <- list(tokenize = strsplit_space_tokenizer,
             removePunctuation = list(preserve_intra_word_dashes = TRUE),
             stopwords = c("reuter", "that"),
             stemming = TRUE,
             wordLengths = c(4, Inf))
termFreq(crude[[14]], control = ctrl)

Example output

Loading required package: NLP

        "none        (bpd).     13-nation       948,000         above 
            1             1             1             1             1 
        after    al-khalifa      al-qabas      al-sabah           ali 
            1             1             1             1             1 
         also      analysts           and         asked       barrels 
            1             1             1             1             1 
          bpd         crude         daily        denied     emergency 
            1             2             2             1             1 
    estimated          fell           for           has international 
            1             1             2             1             1 
    interview           its        kuwait      kuwait's          last 
            1             2             1             1             1 
      limits.         local       meeting     meeting."       members 
            1             1             1             1             1 
      million      minister     newspaper           oil           one 
            1             1             1             4             1 
         opec          over         plans        prices       prices. 
            4             1             1             1             1 
      pumping         quota        quoted        recent        reuter 
            2             1             1             1             1 
         said        saying  self-imposed       sharply        sheikh 
            1             1             1             1             1 
         such          that           the         there       traders 
            1             3             4             1             1 
          was      weakness          week          were         world 
            3             1             1             1             1 

 13-nation     948000       abov      after al-khalifa    al-qaba   al-sabah 
         1          1          1          1          1          1          1 
      also    analyst     barrel      crude      daili       deni      emerg 
         1          1          1          2          2          1          1 
     estim       fell     intern  interview     kuwait       last      limit 
         1          1          1          1          2          1          1 
     local       meet     member    million     minist    newspap       none 
         1          2          1          1          1          1          1 
      opec       over       plan      price       pump       quot      quota 
         4          1          1          2          2          1          1 
    recent       said self-impos    sharpli     sheikh       such      there 
         1          1          1          1          1          1          1 
    trader       weak       week       were      world 
         1          1          1          1          1 

tm documentation built on Nov. 18, 2020, 5:07 p.m.