textmodel_word2vec: Word2vec model

View source: R/word2vec.R

textmodel_word2vecR Documentation

Word2vec model

Description

Train a Word2vec model (Mikolov et al., 2023) in different architectures on a quanteda::tokens object.

Usage

textmodel_word2vec(
  x,
  dim = 50,
  type = c("cbow", "skip-gram"),
  min_count = 5L,
  window = ifelse(type == "cbow", 5L, 10L),
  iter = 10L,
  alpha = 0.05,
  use_ns = TRUE,
  ns_size = 5L,
  sample = 0.001,
  normalize = TRUE,
  verbose = FALSE,
  ...
)

Arguments

x

a quanteda::tokens object.

dim

the size of the word vectors.

type

the architecture of the model; either "cbow" (continuous back of words) or "skip-gram".

min_count

the minimum frequency of the words. Words less frequent than this in x are removed before training.

window

the size of the word window. Words within this window are considered to be the context of a target word.

iter

the number of iterations in model training.

alpha

the initial learning rate.

use_ns

if TRUE, negative sampling is used. Otherwise, hierarchical softmax is used.

ns_size

the size of negative samples. Only used when use_ns = TRUE.

sample

the rate of sampling of words based on their frequency. Sampling is disabled when sample = 1.0

normalize

if TRUE, normalize the vectors in values and weights.

verbose

if TRUE, print the progress of training.

...

additional arguments.

Details

User can changed the number of processors used for the parallel computing via options(wordvector_threads).

Value

Returns a textmodel_wordvector object with the following elements:

values

a matrix for word vector values.

weights

a matrix for word vector weights.

dim

the size of the word vectors.

type

the architecture of the model.

frequency

the frequency of words in x.

window

the size of the word window.

iter

the number of iterations in model training.

alpha

the initial learning rate.

use_ns

the use of negative sampling.

ns_size

the size of negative samples.

concatenator

the concatenator in x.

call

the command used to execute the function.

version

the version of the wordvector package.

References

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. https://arxiv.org/abs/1310.4546.

Examples


library(quanteda)
library(wordvector)

# pre-processing
corp <- data_corpus_news2014 
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% 
   tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% 
   tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
                 padding = TRUE) %>% 
   tokens_tolower()

# train word2vec
w2v <- textmodel_word2vec(toks, dim = 50, type = "cbow", min_count = 5, sample = 0.001)

# find similar words
head(similarity(w2v, c("berlin", "germany", "france"), mode = "words"))
head(similarity(w2v, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values"))
head(similarity(w2v, analogy(~ berlin - germany + france), mode = "words"))


wordvector documentation built on April 12, 2025, 2:23 a.m.