count the number of tokens or types

Share:

Description

Return the count of tokens (total features) or types (unique features) in a text, corpus, or dfm. "tokens" here means all words, not unique words, and these are not cleaned prior to counting.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
ntoken(x, ...)

ntype(x, ...)

## S3 method for class 'corpus'
ntoken(x, ...)

## S3 method for class 'corpus'
ntype(x, ...)

## S3 method for class 'character'
ntoken(x, ...)

## S3 method for class 'tokenizedTexts'
ntoken(x, ...)

## S3 method for class 'character'
ntype(x, ...)

## S3 method for class 'dfm'
ntoken(x, ...)

## S3 method for class 'dfm'
ntype(x, ...)

## S3 method for class 'tokenizedTexts'
ntype(x, ...)

Arguments

x

texts or corpus whose tokens or types will be counted

...

additional arguments passed to tokenize

Value

scalar count of the total tokens or types

Note

Due to differences between raw text tokens and features that have been defined for a dfm, the counts be different for dfm objects and the texts from which the dfm was generated. Because the method tokenizes the text in order to count the tokens, your results will depend on the options passed through to tokenize

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# simple example
txt <- c(text1 = "This is a sentence, this.", text2 = "A word. Repeated repeated.")
ntoken(txt)
ntype(txt)
ntoken(toLower(txt))  # same
ntype(toLower(txt))   # fewer types
ntoken(toLower(txt), removePunct = TRUE)
ntype(toLower(txt), removePunct = TRUE)

# with some real texts
ntoken(subset(inaugCorpus, Year<1806, removePunct = TRUE))
ntype(subset(inaugCorpus, Year<1806, removePunct = TRUE))
ntoken(dfm(subset(inaugCorpus, Year<1800)))
ntype(dfm(subset(inaugCorpus, Year<1800)))

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.