seq_builder: Represent Documents as Token-Integer Sequences
In text2map: R Tools for Text Matrices, Embeddings, and Networks

seq_builder

R Documentation

Represent Documents as Token-Integer Sequences

Description

First, each token in the vocabulary is mapped to an integer in a lookup dictionary. Next, documents are converted to sequences of integers where each integer is an index of the token from the dictionary.

Usage

seq_builder(
  data,
  text,
  doc_id = NULL,
  vocab = NULL,
  maxlen = NULL,
  matrix = TRUE
)

Arguments

`data`	Data.frame with column of texts and column of document ids
`text`	Name of the column with documents' text
`doc_id`	Name of the column with documents' unique ids.
`vocab`	Default is `NULL`, if a list of terms is provided, the function will return a DTM with terms restricted to this vocabulary. Columns will also be in the same order as the list of terms.
`maxlen`	Integer indicating the maximum document length. If NULL (default), the length of the longest document is used.
`matrix`	Logical, `TRUE` (default) returns a matrix, `FALSE` a list

Details

Function will return a matrix of integer sequences by default. The columns will be the length of the longest document or maxlen, with shorter documents padded with zeros. The dictionary will be an attribute of the matrix accessed with attr(seq, "dic"). If matrix = FALSE, the function will return a list of integer sequences. The vocabulary will either be each unique token in the corpus, or a the list of words provided to the vocab argument. This kind of text representation is used in tensorflow and keras.