vocab_builder: A fast unigram vocabulary builder
In text2map: R Tools for Text Matrices, Embeddings, and Networks

vocab_builder

R Documentation

A fast unigram vocabulary builder

Description

A streamlined function to take raw texts from a column of a data.frame and produce a list of all the unique tokens. Tokenizes by the fixed, single whitespace, and then extracts the unique tokens. This can be used as input to dtm_builder() to standardize the vocabulary (i.e. the columns) across multiple DTMs. Prior to building the vocabulary, texts should have whitespace trimmed, if desired, punctuation removed and terms lowercased.