vocab_builder: A fast unigram vocabulary builder

View source: R/utils-dtm.R

vocab_builderR Documentation

A fast unigram vocabulary builder

Description

A streamlined function to take raw texts from a column of a data.frame and produce a list of all the unique tokens. Tokenizes by the fixed, single whitespace, and then extracts the unique tokens. This can be used as input to dtm_builder() to standardize the vocabulary (i.e. the columns) across multiple DTMs. Prior to building the vocabulary, texts should have whitespace trimmed, if desired, punctuation removed and terms lowercased.

Usage

vocab_builder(data, text)

Arguments

data

Data.frame with one column of texts

text

Name of the column with documents' text

Value

returns a list of unique terms in a corpus

Author(s)

Dustin Stoltz


text2map documentation built on May 29, 2024, 2:54 a.m.