Description Usage Arguments Details Value See Also Examples
Tokenize a set of texts and compute a term frequency matrix.
1 2 3 4 5 
x 
a text vector to tokenize. 
filter 
if non 
ngrams 
an integer vector of ngram lengths to include, or

select 
a character vector of terms to count, or 
group 
if non 
transpose 
a logical value indicating whether to transpose the result, putting terms as rows instead of columns. 
... 
additional properties to set on the text filter. 
term_matrix
tokenizes a set of texts and computes the occurrence
counts for each term, returning the result as a sparse matrix
(textsbyterms). term_counts
returns the same information, but
in a data frame.
If ngrams
is nonNULL
, then multitype ngrams are
included in the output for all lengths appearing in the ngrams
argument. If ngrams
is NULL
but select
is
nonNULL
, then all ngrams appearing in the select
set
are included. If both ngrams
and select
are NULL
,
then only unigrams (single type terms) are included.
If group
is NULL
, then the output has one set of term
counts for each input text. Otherwise, we convert group
to
a factor
and compute one set of term counts for each level.
Texts with NA
values for group
get skipped.
term_matrix
with transpose = FALSE
returns a sparse matrix
in "dgCMatrix"
format with one column for each term and one row for
each input text or (if group
is nonNULL
) for each grouping
level. If filter$select
is nonNULL
, then the column names
will be equal to filter$select
. Otherwise, the columns are assigned
in arbitrary order.
term_matrix
with transpose = TRUE
returns the transpose of
the term matrix, in "dgCMatrix"
format.
term_counts
with group = NULL
returns a data frame with one
row for each entry of the term matrix, and columns "text"
,
"term"
, and "count"
giving the text ID, term, and count.
The "term"
column is a factor with levels equal to the selected
terms. The "text"
column is a factor with levels equal to names(as_corpus_text(x))
;
calling as.integer
on the "text"
column converts from
the factor values to the integer row index in the term matrix.
term_counts
with group
nonNULL
behaves similarly,
but the result instead has columns named "group"
, "term"
,
and "count"
, with "group"
giving the grouping level, as
a factor.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28  text < c("A rose is a rose is a rose.",
"A Rose is red, a violet is blue!",
"A rose by any other name would smell as sweet.")
term_matrix(text)
# select certain terms
term_matrix(text, select = c("rose", "red", "violet", "sweet"))
# specify a grouping factor
term_matrix(text, group = c("Good", "Bad", "Good"))
# include higherorder ngrams
term_matrix(text, ngrams = 1:3)
# select certain multitype terms
term_matrix(text, select = c("a rose", "a violet", "sweet", "smell"))
# transpose the result
term_matrix(text, ngrams = 1:2, transpose = TRUE)[1:10, ] # first 10 rows
# data frame
head(term_counts(text), n = 10) # first 10 rows
# with grouping
term_counts(text, group = c("Good", "Bad", "Good"))
# taking names from the input
term_counts(c(a = "One sentence.", b = "Another", c = "!!"))

Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.