Tokenize a set of texts and compute a term frequency matrix.
1 2 3 4 5
a text vector to tokenize.
an integer vector of n-gram lengths to include, or
a character vector of terms to count, or
a logical value indicating whether to transpose the result, putting terms as rows instead of columns.
additional properties to set on the text filter.
term_matrix tokenizes a set of texts and computes the occurrence
counts for each term, returning the result as a sparse matrix
term_counts returns the same information, but
in a data frame.
ngrams is non-
NULL, then multi-type n-grams are
included in the output for all lengths appearing in the
NULL, then all n-grams appearing in the
are included. If both
then only unigrams (single type terms) are included.
NULL, then the output has one set of term
counts for each input text. Otherwise, we convert
factor and compute one set of term counts for each level.
NA values for
group get skipped.
transpose = FALSE returns a sparse matrix
"dgCMatrix" format with one column for each term and one row for
each input text or (if
group is non-
NULL) for each grouping
filter$select is non-
NULL, then the column names
will be equal to
filter$select. Otherwise, the columns are assigned
in arbitrary order.
transpose = TRUE returns the transpose of
the term matrix, in
group = NULL returns a data frame with one
row for each entry of the term matrix, and columns
"count" giving the text ID, term, and count.
"term" column is a factor with levels equal to the selected
column is a factor with levels equal to
as.integer on the
"text" column converts from
the factor values to the integer row index in the term matrix.
NULL behaves similarly,
but the result instead has columns named
"group" giving the grouping level, as
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28
text <- c("A rose is a rose is a rose.", "A Rose is red, a violet is blue!", "A rose by any other name would smell as sweet.") term_matrix(text) # select certain terms term_matrix(text, select = c("rose", "red", "violet", "sweet")) # specify a grouping factor term_matrix(text, group = c("Good", "Bad", "Good")) # include higher-order n-grams term_matrix(text, ngrams = 1:3) # select certain multi-type terms term_matrix(text, select = c("a rose", "a violet", "sweet", "smell")) # transpose the result term_matrix(text, ngrams = 1:2, transpose = TRUE)[1:10, ] # first 10 rows # data frame head(term_counts(text), n = 10) # first 10 rows # with grouping term_counts(text, group = c("Good", "Bad", "Good")) # taking names from the input term_counts(c(a = "One sentence.", b = "Another", c = "!!"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.