View source: R/tokens_ngrams.R
tokens_ngrams | R Documentation |
Create a set of n-grams (tokens in sequence) from already tokenized text objects, with an optional skip argument to form skip-grams. Both the n-gram length and the skip lengths take vectors of arguments to form multiple lengths or skips in one pass. Implemented in C++ for efficiency.
tokens_ngrams(x, n = 2L, skip = 0L, concatenator = concat(x))
char_ngrams(x, n = 2L, skip = 0L, concatenator = "_")
tokens_skipgrams(x, n, skip, concatenator = concat(x))
x |
a tokens object, or a character vector, or a list of characters |
n |
integer vector specifying the number of elements to be concatenated
in each n-gram. Each element of this vector will define a |
skip |
integer vector specifying the adjacency skip size for tokens
forming the n-grams, default is 0 for only immediately neighbouring words.
For |
concatenator |
character for combining words, default is |
Normally, these functions will be called through
[tokens](x, ngrams = , ...)
, but these functions are provided
in case a user wants to perform lower-level n-gram construction on tokenized
texts.
tokens_skipgrams()
is a wrapper to tokens_ngrams()
that requires
arguments to be supplied for both n
and skip
. For k
-skip
skip-grams, set skip
to 0:
k
, in order to conform to the
definition of skip-grams found in Guthrie et al (2006): A k
skip-gram
is an n-gram which is a superset of all n-grams and each (k-i)
skip-gram until (k-i)==0
(which includes 0 skip-grams).
a tokens object consisting a list of character vectors of n-grams, one list element per text, or a character vector if called on a simple character vector
char_ngrams
is a convenience wrapper for a (non-list)
vector of characters, so named to be consistent with quanteda's naming
scheme.
Guthrie, David, Ben Allison, Wei Liu, Louise Guthrie, and Yorick Wilks. 2006.
"A Closer Look at Skip-Gram Modelling." https://aclanthology.org/L06-1210/
# ngrams
tokens_ngrams(tokens(c("a b c d e", "c d e f g")), n = 2:3)
toks <- tokens(c(text1 = "the quick brown fox jumped over the lazy dog"))
tokens_ngrams(toks, n = 1:3)
tokens_ngrams(toks, n = c(2,4), concatenator = " ")
tokens_ngrams(toks, n = c(2,4), skip = 1, concatenator = " ")
# skipgrams
toks <- tokens("insurgents killed in ongoing fighting")
tokens_skipgrams(toks, n = 2, skip = 0:1, concatenator = " ")
tokens_skipgrams(toks, n = 2, skip = 0:2, concatenator = " ")
tokens_skipgrams(toks, n = 3, skip = 0:2, concatenator = " ")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.