Description Usage Arguments Details Value See Also Examples
Tokenize a set of texts and compute a term frequency matrix.
1 2 3 4 5 |
x |
a text vector to tokenize. |
filter |
if non- |
ngrams |
an integer vector of n-gram lengths to include, or
|
select |
a character vector of terms to count, or |
group |
if non- |
transpose |
a logical value indicating whether to transpose the result, putting terms as rows instead of columns. |
... |
additional properties to set on the text filter. |
term_matrix
tokenizes a set of texts and computes the occurrence
counts for each term, returning the result as a sparse matrix
(texts-by-terms). term_counts
returns the same information, but
in a data frame.
If ngrams
is non-NULL
, then multi-type n-grams are
included in the output for all lengths appearing in the ngrams
argument. If ngrams
is NULL
but select
is
non-NULL
, then all n-grams appearing in the select
set
are included. If both ngrams
and select
are NULL
,
then only unigrams (single type terms) are included.
If group
is NULL
, then the output has one set of term
counts for each input text. Otherwise, we convert group
to
a factor
and compute one set of term counts for each level.
Texts with NA
values for group
get skipped.
term_matrix
with transpose = FALSE
returns a sparse matrix
in "dgCMatrix"
format with one column for each term and one row for
each input text or (if group
is non-NULL
) for each grouping
level. If filter$select
is non-NULL
, then the column names
will be equal to filter$select
. Otherwise, the columns are assigned
in arbitrary order.
term_matrix
with transpose = TRUE
returns the transpose of
the term matrix, in "dgCMatrix"
format.
term_counts
with group = NULL
returns a data frame with one
row for each entry of the term matrix, and columns "text"
,
"term"
, and "count"
giving the text ID, term, and count.
The "term"
column is a factor with levels equal to the selected
terms. The "text"
column is a factor with levels equal to names(as_corpus_text(x))
;
calling as.integer
on the "text"
column converts from
the factor values to the integer row index in the term matrix.
term_counts
with group
non-NULL
behaves similarly,
but the result instead has columns named "group"
, "term"
,
and "count"
, with "group"
giving the grouping level, as
a factor.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 | text <- c("A rose is a rose is a rose.",
"A Rose is red, a violet is blue!",
"A rose by any other name would smell as sweet.")
term_matrix(text)
# select certain terms
term_matrix(text, select = c("rose", "red", "violet", "sweet"))
# specify a grouping factor
term_matrix(text, group = c("Good", "Bad", "Good"))
# include higher-order n-grams
term_matrix(text, ngrams = 1:3)
# select certain multi-type terms
term_matrix(text, select = c("a rose", "a violet", "sweet", "smell"))
# transpose the result
term_matrix(text, ngrams = 1:2, transpose = TRUE)[1:10, ] # first 10 rows
# data frame
head(term_counts(text), n = 10) # first 10 rows
# with grouping
term_counts(text, group = c("Good", "Bad", "Good"))
# taking names from the input
term_counts(c(a = "One sentence.", b = "Another", c = "!!"))
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.