prep_word2vec: Prepare documents for word2Vec

View source: R/word2vec.R

prep_word2vecR Documentation

Prepare documents for word2Vec

Description

This function exports a directory or document to a single file suitable to Word2Vec run on. That means a single, seekable txt file with tokens separated by spaces. (For example, punctuation is removed rather than attached to the end of words.) This function is extraordinarily inefficient: in most real-world cases, you'll be much better off preparing the documents using python, perl, awk, or any other scripting language that can reasonable read things in line-by-line.

Usage

prep_word2vec(origin, destination, lowercase = F, bundle_ngrams = 1, ...)

Arguments

origin

A text file or a directory of text files to be used in training the model

destination

The location for output text.

lowercase

Logical. Should uppercase characters be converted to lower?

bundle_ngrams

Integer. Statistically significant phrases of up to this many words will be joined with underscores: e.g., "United States" will usually be changed to "United_States" if it appears frequently in the corpus. This calls word2phrase once if bundle_ngrams is 2, twice if bundle_ngrams is 3, and so forth; see that function for more details.

...

Further arguments passed to word2phrase when bundle_ngrams is greater than 1.

Value

The file name (silently).


bmschmidt/wordVectors documentation built on June 2, 2022, 3:53 p.m.