Create ngrams and skipgrams

Share:

Description

Create a set of ngrams (tokens in sequence) from character vectors or tokenized text objects, with an optional skip argument to form skipgrams. Both the ngram length and the skip lengths take vectors of arguments to form multiple lengths or skips in one pass. ngrams() is implemented in C++ for efficiency.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
ngrams(x, ...)

## S3 method for class 'character'
ngrams(x, n = 2L, skip = 0L, concatenator = "_", ...)

## S3 method for class 'tokenizedTexts'
ngrams(x, n = 2L, skip = 0L, concatenator = "_",
  ...)

skipgrams(x, ...)

## S3 method for class 'character'
skipgrams(x, n, skip, concatenator = "_", ...)

## S3 method for class 'tokenizedTexts'
skipgrams(x, n, skip, concatenator = "_", ...)

Arguments

x

a tokenizedText object or a character vector of tokens

...

not used

n

integer vector specifying the number of elements to be concatenated in each ngram

skip

integer vector specifying the adjacency skip size for tokens forming the ngrams, default is 0 for only immediately neighbouring words. For skipgrams, skip can be a vector of integers, as the "classic" approach to forming skip-grams is to set skip = k where k is the distance for which k or fewer skips are used to construct the n-gram. Thus a "4-skip-n-gram" defined as skip = 0:4 produces results that include 4 skips, 3 skips, 2 skips, 1 skip, and 0 skips (where 0 skips are typical n-grams formed from adjacent words). See Guthrie et al (2006).

concatenator

character for combining words, default is _ (underscore) character

Details

Normally, ngrams will be called through tokenize, but these functions are also exported in case a user wants to perform lower-level ngram construction on tokenized texts.

skipgrams is a wrapper to ngrams that requires arguments to be supplied for both n and skip. For k-skip skipgrams, set skip to 0:k, in order to conform to the definition of skip-grams found in Guthrie et al (2006): A k skip-gram is an ngram which is a superset of all ngrams and each (k-i) skipgram until (k-i)==0 (which includes 0 skip-grams).

Value

a tokenizedTexts object consisting a list of character vectors of ngrams, one list element per text, or a character vector if called on a simple character vector

Author(s)

Kohei Watanabe and Ken Benoit

References

Guthrie, D., B. Allison, W. Liu, and L. Guthrie. 2006. "A Closer Look at Skip-Gram Modelling."

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
# ngrams
ngrams(LETTERS, n = 2)
ngrams(LETTERS, n = 2, skip = 1)
ngrams(LETTERS, n = 2, skip = 0:1)
ngrams(LETTERS, n = 1:2)
ngrams(LETTERS, n = c(2,3), skip = 0:1)

tokens <- tokenize("the quick brown fox jumped over the lazy dog.", 
                   removePunct = TRUE, simplify = TRUE)
ngrams(tokens, n = 1:3)
ngrams(tokens, n = c(2,4), concatenator = " ")
ngrams(tokens, n = c(2,4), skip = 1, concatenator = " ")

# skipgrams
tokens <- tokenize(toLower("Insurgents killed in ongoing fighting."), 
                   removePunct = TRUE, simplify = TRUE)
skipgrams(tokens, n = 2, skip = 0:1, concatenator = " ") 
skipgrams(tokens, n = 2, skip = 0:2, concatenator = " ") 
skipgrams(tokens, n = 3, skip = 0:2, concatenator = " ")   

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.