Description Usage Arguments Details Value See Also Examples
Tokenize a set of texts and tabulate the term occurrence statistics.
1 2 3 4 |
x |
a text vector to tokenize. |
filter |
if non- |
ngrams |
an integer vector of n-gram lengths to include, or
|
min_count |
a numeric scalar giving the minimum term count to include
in the output, or |
max_count |
a numeric scalar giving the maximum term count to include
in the output, or |
min_support |
a numeric scalar giving the minimum term support to
include in the output, or |
max_support |
a numeric scalar giving the maximum term support to
include in the output, or |
types |
a logical value indicating whether to include columns for the types that make up the terms. |
subset |
logical expression indicating elements or rows to keep: missing values are taken as false. |
... |
additional properties to set on the text filter. |
term_stats tokenizes a set of texts and computes the occurrence
counts and supports for each term. The ‘count’ is the number of
occurrences of the term across all texts; the ‘support’ is the
number of texts containing the term. Each appearance of a term
increments its count by one. Likewise, an appearance of a term in text
i increments its support once, not for each occurrence
in the text.
To include multi-type terms, specify the designed term lengths using
the ngrams argument.
A data frame with columns named term, count, and
support, with one row for each appearing term. Rows are sorted
in descending order according to support and then count,
with ties broken lexicographically by term, using the
character ordering determined by the current locale
(see Comparison for details).
If types = TRUE, then the result also includes columns named
type1, type2, etc. for the types that make up the
term.
1 2 3 4 5 6 7 8 9 10 11 | term_stats("A rose is a rose is a rose.")
# remove punctuation and English stop words
term_stats("A rose is a rose is a rose.",
text_filter(drop_symbol = TRUE, drop = stopwords_en))
# unigrams, bigrams, and trigrams
term_stats("A rose is a rose is a rose.", ngrams = 1:3)
# also include the type information
term_stats("A rose is a rose is a rose.", ngrams = 1:3, types = TRUE)
|
term count support
1 a 3 1
2 rose 3 1
3 is 2 1
4 . 1 1
term count support
1 rose 3 1
2 . 1 1
term count support
1 a 3 1
2 a rose 3 1
3 rose 3 1
4 a rose is 2 1
5 is 2 1
6 is a 2 1
7 is a rose 2 1
8 rose is 2 1
9 rose is a 2 1
10 . 1 1
11 a rose . 1 1
12 rose . 1 1
term type1 type2 type3 count support
1 a a <NA> <NA> 3 1
2 a rose a rose <NA> 3 1
3 rose rose <NA> <NA> 3 1
4 a rose is a rose is 2 1
5 is is <NA> <NA> 2 1
6 is a is a <NA> 2 1
7 is a rose is a rose 2 1
8 rose is rose is <NA> 2 1
9 rose is a rose is a 2 1
10 . . <NA> <NA> 1 1
11 a rose . a rose . 1 1
12 rose . rose . <NA> 1 1
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.