term_indices: Term Indices: Convert text to integer indices

Description Usage Arguments Value Examples

Description

Term Indices: Convert text to integer indices

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
tix_seq(corpus, vocab, keep_unknown = nbuckets > 0,
  nbuckets = attr(vocab, "nbuckets"), reverse = FALSE)

tix_df(corpus, vocab, keep_unknown = nbuckets > 0,
  nbuckets = attr(vocab, "nbuckets"), reverse = FALSE,
  as_factor = FALSE)

tix_mat(corpus, vocab, maxlen = 100, pad_right = TRUE,
  trunc_right = TRUE, keep_unknown = nbuckets > 0,
  nbuckets = attr(vocab, "nbuckets"), reverse = FALSE)

Arguments

corpus

text corpus; see vocab().

vocab

data frame produced by vocab() or update_vocab()

keep_unknown

logical. If TRUE, preserve unknowns in the output sequences. When nbuckets == 0 then unknowns are indexed with 0.

nbuckets

integer. How many buckets to hash unknowns into.

reverse

logical. Should each sequence be reversed in the final output? Reversion happens after pad_right and trunc_right have been applied to the original text sequence. Default FALSE.

as_factor

if TRUE, the returned index column will be a factor instead of an integer vector. Will throw an error when keep_unknown is TRUE and nbuckets == 0.

maxlen

integer. Maximum length of each sequence.

pad_right

logical. Should 0-padding of shorter than maxlen sequences happen on the right? Default TRUE.

trunc_right

logical. Should truncation of longer than maxlen sequences happen on the right? Default TRUE.

Value

tix_seq() returns a list of integer vectors, tix_df() produces a flat index data.frame() with two columns, tix_mat() returns an integer matrix, one row per sequence.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
corpus <- list(a = c("The", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"),
               b = c("the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog",
                     "the", "quick", "brown", "fox", "jumps", "over", "the", "lazy", "dog"))
v <- vocab(corpus["b"]) # "The" is unknown
v

tix_seq(corpus, v)
tix_seq(corpus, v, keep_unknown = TRUE)
tix_seq(corpus, v, nbuckets = 1)
tix_seq(corpus, v, nbuckets = 3)

tix_mat(corpus, v, maxlen = 12)
tix_mat(corpus, v, maxlen = 12, keep_unknown = TRUE)
tix_mat(corpus, v, maxlen = 12, nbuckets = 1)
tix_mat(corpus, v, maxlen = 12, nbuckets = 1, reverse = TRUE)
tix_mat(corpus, v, maxlen = 12, pad_right = FALSE, nbuckets = 1)
tix_mat(corpus, v, maxlen = 12, trunc_right = FALSE, nbuckets = 1)

mlvocab documentation built on Sept. 21, 2018, 6:35 p.m.