Description Usage Arguments Details Value See Also Examples
Tokenize a set of texts and tabulate the term occurrence statistics.
1 2 3 4 |
x |
a text vector to tokenize. |
filter |
if non- |
ngrams |
an integer vector of n-gram lengths to include, or
|
min_count |
a numeric scalar giving the minimum term count to include
in the output, or |
max_count |
a numeric scalar giving the maximum term count to include
in the output, or |
min_support |
a numeric scalar giving the minimum term support to
include in the output, or |
max_support |
a numeric scalar giving the maximum term support to
include in the output, or |
types |
a logical value indicating whether to include columns for the types that make up the terms. |
subset |
logical expression indicating elements or rows to keep: missing values are taken as false. |
... |
additional properties to set on the text filter. |
term_stats
tokenizes a set of texts and computes the occurrence
counts and supports for each term. The ‘count’ is the number of
occurrences of the term across all texts; the ‘support’ is the
number of texts containing the term. Each appearance of a term
increments its count by one. Likewise, an appearance of a term in text
i
increments its support once, not for each occurrence
in the text.
To include multi-type terms, specify the designed term lengths using
the ngrams
argument.
A data frame with columns named term
, count
, and
support
, with one row for each appearing term. Rows are sorted
in descending order according to support
and then count
,
with ties broken lexicographically by term
, using the
character ordering determined by the current locale
(see Comparison
for details).
If types = TRUE
, then the result also includes columns named
type1
, type2
, etc. for the types that make up the
term.
1 2 3 4 5 6 7 8 9 10 11 | term_stats("A rose is a rose is a rose.")
# remove punctuation and English stop words
term_stats("A rose is a rose is a rose.",
text_filter(drop_symbol = TRUE, drop = stopwords_en))
# unigrams, bigrams, and trigrams
term_stats("A rose is a rose is a rose.", ngrams = 1:3)
# also include the type information
term_stats("A rose is a rose is a rose.", ngrams = 1:3, types = TRUE)
|
term count support
1 a 3 1
2 rose 3 1
3 is 2 1
4 . 1 1
term count support
1 rose 3 1
2 . 1 1
term count support
1 a 3 1
2 a rose 3 1
3 rose 3 1
4 a rose is 2 1
5 is 2 1
6 is a 2 1
7 is a rose 2 1
8 rose is 2 1
9 rose is a 2 1
10 . 1 1
11 a rose . 1 1
12 rose . 1 1
term type1 type2 type3 count support
1 a a <NA> <NA> 3 1
2 a rose a rose <NA> 3 1
3 rose rose <NA> <NA> 3 1
4 a rose is a rose is 2 1
5 is is <NA> <NA> 2 1
6 is a is a <NA> 2 1
7 is a rose is a rose 2 1
8 rose is rose is <NA> 2 1
9 rose is a rose is a 2 1
10 . . <NA> <NA> 1 1
11 a rose . a rose . 1 1
12 rose . rose . <NA> 1 1
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.