ngram-tokenizers | R Documentation |
These functions tokenize their inputs into different kinds of n-grams. The input can be a character vector of any length, or a list of character vectors where each character vector in the list has a length of 1. See details for an explanation of what each function does.
tokenize_ngrams(
x,
lowercase = TRUE,
n = 3L,
n_min = n,
stopwords = character(),
ngram_delim = " ",
simplify = FALSE
)
tokenize_skip_ngrams(
x,
lowercase = TRUE,
n_min = 1,
n = 3,
k = 1,
stopwords = character(),
simplify = FALSE
)
x |
A character vector or a list of character vectors to be tokenized
into n-grams. If |
lowercase |
Should the tokens be made lower case? |
n |
The number of words in the n-gram. This must be an integer greater than or equal to 1. |
n_min |
The minimum number of words in the n-gram. This must be an
integer greater than or equal to 1, and less than or equal to |
stopwords |
A character vector of stop words to be excluded from the n-grams. |
ngram_delim |
The separator between words in an n-gram. |
simplify |
|
k |
For the skip n-gram tokenizer, the maximum skip distance between
words. The function will compute all skip n-grams between |
tokenize_ngrams
: Basic shingled n-grams. A
contiguous subsequence of n
words. This will compute shingled n-grams
for every value of between n_min
(which must be at least 1) and
n
.
tokenize_skip_ngrams
:Skip n-grams. A subsequence
of n
words which are at most a gap of k
words between them. The
skip n-grams will be calculated for all values from 0
to k
.
These functions will strip all punctuation and normalize all whitespace to a single space character.
A list of character vectors containing the tokens, with one element
in the list for each element that was passed as input. If simplify =
TRUE
and only a single element was passed as input, then the output is a
character vector of tokens.
song <- paste0("How many roads must a man walk down\n",
"Before you call him a man?\n",
"How many seas must a white dove sail\n",
"Before she sleeps in the sand?\n",
"\n",
"How many times must the cannonballs fly\n",
"Before they're forever banned?\n",
"The answer, my friend, is blowin' in the wind.\n",
"The answer is blowin' in the wind.\n")
tokenize_ngrams(song, n = 4)
tokenize_ngrams(song, n = 4, n_min = 1)
tokenize_skip_ngrams(song, n = 4, k = 2)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.