train_biword2vec: Train a model by bi-directional word2vec (biword2vec).

Description Usage Arguments Value References Examples

Description

Trains a bi-directional word2vec model on a corpus or training data, which is in general a .txt file. A bi-directional word2vec model uses separate vector representations for the left and right context of a word. This enables the user to determine enrichment of association of a word with another in terms of context of use.

Usage

1
2
3
4
5
train_biword2vec(train_file, output_file_left = "vectors_left.bin",
  output_file_right = "vectors_right.bin",
  output_file_out = "vectors_out.bin", vectors = 100, threads = 3,
  window = 12, classes = 0, cbow = 0, min_count = 1, iter = 5,
  force = F, negative_samples = 0)

Arguments

train_file

Path of a single .txt file for training. Tokens are split on spaces.

output_file_left

Path of the output file for the left context words.

output_file_right

Path of the output fle for the right context words.

vectors

The number of vectors to output. Defaults to 100. More vectors usually means more precision, but also more random error, higher memory usage, and slower operations. Sensible choices are probably in the range 100-500.

threads

Number of threads to run training process on. Defaults to 1; up to the number of (virtual) cores on your machine may speed things up.

window

The size of the window (in words) to use in training.

classes

Number of classes for k-means clustering. Not documented/tested.

cbow

If 1, use a continuous-bag-of-words model instead of skip-grams. Defaults to false (recommended for newcomers).

min_count

Minimum times a word must appear to be included in the samples. High values help reduce model size.

iter

Number of passes to make over the corpus in training.

force

Whether to overwrite existing model files.

negative_samples

Number of negative samples to take in skip-gram training. 0 means full sampling, while lower numbers give faster training. For large corpora 2-5 may work; for smaller corpora, 5-15 is reasonable.

Value

A VectorSpaceModel object.

References

https://code.google.com/p/word2vec/

Examples

1
2
3
4
5
6
## Not run: 
model = train_biword2vec("nation.txt", output_file_left = "out_left.bin",
                         output_file_right = "out_right.bin", threads = 3,
                         woindow = 5)

## End(Not run)

kkdey/WEAVER documentation built on May 8, 2019, 9:24 a.m.