termFreq: Term Frequency Vector

Description Usage Arguments Value See Also Examples

View source: R/matrix.R

Description

Generate a term frequency vector from a text document.

Usage

1
termFreq(doc, control = list())

Arguments

doc

An object inheriting from TextDocument or a character vector.

control

A list of control options which override default settings.

First, following two options are processed.

tokenize

A function tokenizing a TextDocument into single tokens, a Span_Tokenizer, Token_Tokenizer, or a string matching one of the predefined tokenization functions:

"Boost"

for Boost_tokenizer, or

"MC"

for MC_tokenizer, or

"scan"

for scan_tokenizer, or

"words"

for words.

Defaults to words.

tolower

Either a logical value indicating whether characters should be translated to lower case or a custom function converting characters to lower case. Defaults to tolower.

Next, a set of options which are sensitive to the order of occurrence in the control list. Options are processed in the same order as specified. User-specified options have precedence over the default ordering so that first all user-specified options and then all remaining options (with the default settings and in the order as listed below) are processed.

language

A character giving the language (preferably as IETF language tags, see language in package NLP) to be used for stopwords and stemming if not provided by doc.

removePunctuation

A logical value indicating whether punctuation characters should be removed from doc, a custom function which performs punctuation removal, or a list of arguments for removePunctuation. Defaults to FALSE.

removeNumbers

A logical value indicating whether numbers should be removed from doc or a custom function for number removal. Defaults to FALSE.

stopwords

Either a Boolean value indicating stopword removal using default language specific stopword lists shipped with this package, a character vector holding custom stopwords, or a custom function for stopword removal. Defaults to FALSE.

stemming

Either a Boolean value indicating whether tokens should be stemmed or a custom stemming function. Defaults to FALSE.

Finally, following options are processed in the given order.

dictionary

A character vector to be tabulated against. No other terms will be listed in the result. Defaults to NULL which means that all terms in doc are listed.

bounds

A list with a tag local whose value must be an integer vector of length 2. Terms that appear less often in doc than the lower bound bounds$local[1] or more often than the upper bound bounds$local[2] are discarded. Defaults to list(local = c(1, Inf)) (i.e., every token will be used).

wordLengths

An integer vector of length 2. Words shorter than the minimum word length wordLengths[1] or longer than the maximum word length wordLengths[2] are discarded. Defaults to c(3, Inf), i.e., a minimum word length of 3 characters.

Value

A table of class c("term_frequency", "integer") with term frequencies as values and tokens as names.

See Also

getTokenizers

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
data("crude")
termFreq(crude[[14]])
strsplit_space_tokenizer <- function(x)
    unlist(strsplit(as.character(x), "[[:space:]]+"))
ctrl <- list(tokenize = strsplit_space_tokenizer,
             removePunctuation = list(preserve_intra_word_dashes = TRUE),
             stopwords = c("reuter", "that"),
             stemming = TRUE,
             wordLengths = c(4, Inf))
termFreq(crude[[14]], control = ctrl)

Example output

Loading required package: NLP

        "none        (bpd).     13-nation       948,000         above 
            1             1             1             1             1 
        after    al-khalifa      al-qabas      al-sabah           ali 
            1             1             1             1             1 
         also      analysts           and         asked       barrels 
            1             1             1             1             1 
          bpd         crude         daily        denied     emergency 
            1             2             2             1             1 
    estimated          fell           for           has international 
            1             1             2             1             1 
    interview           its        kuwait      kuwait's          last 
            1             2             1             1             1 
      limits.         local       meeting     meeting."       members 
            1             1             1             1             1 
      million      minister     newspaper           oil           one 
            1             1             1             4             1 
         opec          over         plans        prices       prices. 
            4             1             1             1             1 
      pumping         quota        quoted        recent        reuter 
            2             1             1             1             1 
         said        saying  self-imposed       sharply        sheikh 
            1             1             1             1             1 
         such          that           the         there       traders 
            1             3             4             1             1 
          was      weakness          week          were         world 
            3             1             1             1             1 

 13-nation     948000       abov      after al-khalifa    al-qaba   al-sabah 
         1          1          1          1          1          1          1 
      also    analyst     barrel      crude      daili       deni      emerg 
         1          1          1          2          2          1          1 
     estim       fell     intern  interview     kuwait       last      limit 
         1          1          1          1          2          1          1 
     local       meet     member    million     minist    newspap       none 
         1          2          1          1          1          1          1 
      opec       over       plan      price       pump       quot      quota 
         4          1          1          2          2          1          1 
    recent       said self-impos    sharpli     sheikh       such      there 
         1          1          1          1          1          1          1 
    trader       weak       week       were      world 
         1          1          1          1          1 

tm documentation built on April 7, 2021, 3:01 a.m.