ntoken: count the number of tokens or types

Description Usage Arguments Details Value Note Examples

View source: R/nfunctions.R

Description

Get the count of tokens (total features) or types (unique tokens).

Usage

1
2
3
ntoken(x, ...)

ntype(x, ...)

Arguments

x

a quanteda object: a character, corpus, tokens, or dfm object

...

additional arguments passed to tokens

Details

The precise definition of "tokens" for objects not yet tokenized (e.g. character or corpus objects) can be controlled through optional arguments passed to tokens through ....

For dfm objects, ntype will only return the count of features that occur more than zero times in the dfm.

Value

count of the total tokens or types

Note

Due to differences between raw text tokens and features that have been defined for a dfm, the counts may be different for dfm objects and the texts from which the dfm was generated. Because the method tokenizes the text in order to count the tokens, your results will depend on the options passed through to tokens.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
# simple example
txt <- c(text1 = "This is a sentence, this.", text2 = "A word. Repeated repeated.")
ntoken(txt)
ntype(txt)
ntoken(char_tolower(txt))  # same
ntype(char_tolower(txt))   # fewer types
ntoken(char_tolower(txt), remove_punct = TRUE)
ntype(char_tolower(txt), remove_punct = TRUE)

# with some real texts
ntoken(corpus_subset(data_corpus_inaugural, Year<1806), remove_punct = TRUE)
ntype(corpus_subset(data_corpus_inaugural, Year<1806), remove_punct = TRUE)
ntoken(dfm(corpus_subset(data_corpus_inaugural, Year<1800)))
ntype(dfm(corpus_subset(data_corpus_inaugural, Year<1800)))

Example output

quanteda version 0.9.9.65
Disabling parallel computing

Attaching package: 'quanteda'

The following object is masked from 'package:utils':

    View

text1 text2 
    7     6 
text1 text2 
    7     5 
text1 text2 
    7     6 
text1 text2 
    6     4 
text1 text2 
    5     4 
text1 text2 
    4     3 
1789-Washington 1793-Washington      1797-Adams  1801-Jefferson  1805-Jefferson 
           1430             135            2318            1726            2166 
1789-Washington 1793-Washington      1797-Adams  1801-Jefferson  1805-Jefferson 
            617              91             819             711             799 
1789-Washington 1793-Washington      1797-Adams 
           1540             147            2584 
1789-Washington 1793-Washington      1797-Adams 
            602              95             801 

quanteda documentation built on Aug. 16, 2017, 1:03 a.m.