textmodel_word2vec: Word2vec model
In wordvector: Word and Document Vector Models

View source: R/word2vec.R

textmodel_word2vec

R Documentation

Word2vec model

Description

Train a Word2vec model (Mikolov et al., 2023) in different architectures on a quanteda::tokens object.

Usage

textmodel_word2vec(
  x,
  dim = 50,
  type = c("cbow", "skip-gram"),
  min_count = 5,
  window = ifelse(type == "cbow", 5, 10),
  iter = 10,
  alpha = 0.05,
  model = NULL,
  use_ns = TRUE,
  ns_size = 5,
  sample = 0.001,
  tolower = TRUE,
  include_data = FALSE,
  verbose = FALSE,
  ...
)

Arguments

`x`	a quanteda::tokens or quanteda::tokens_xptr object.
`dim`	the size of the word vectors.
`type`	the architecture of the model; either "cbow" (continuous back of words) or "skip-gram".
`min_count`	the minimum frequency of the words. Words less frequent than this in `x` are removed before training.
`window`	the size of the word window. Words within this window are considered to be the context of a target word.
`iter`	the number of iterations in model training.
`alpha`	the initial learning rate.
`model`	a trained Word2vec model; if provided, its word vectors are updated for `x`.
`use_ns`	if `TRUE`, negative sampling is used. Otherwise, hierarchical softmax is used.
`ns_size`	the size of negative samples. Only used when `use_ns = TRUE`.
`sample`	the rate of sampling of words based on their frequency. Sampling is disabled when `sample = 1.0`
`tolower`	lower-case all the tokens before fitting the model.
`include_data`	if `TRUE`, the resulting object includes the data supplied as `x`.
`verbose`	if `TRUE`, print the progress of training.
`...`	additional arguments.

Details

User can changed the number of processors used for the parallel computing via options(wordvector_threads).

Value

Returns a textmodel_wordvector object with the following elements:

`values`	a matrix for word vector values.
`weights`	a matrix for word vector weights.
`dim`	the size of the word vectors.
`type`	the architecture of the model.
`frequency`	the frequency of words in `x`.
`window`	the size of the word window.
`iter`	the number of iterations in model training.
`alpha`	the initial learning rate.
`use_ns`	the use of negative sampling.
`ns_size`	the size of negative samples.
`min_count`	the value of min_count.
`concatenator`	the concatenator in `x`.
`data`	the original data supplied as `x` if `include_data = TRUE`.
`call`	the command used to execute the function.
`version`	the version of the wordvector package.

References

Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. https://arxiv.org/abs/1310.4546.

Examples


library(quanteda)
library(wordvector)

# pre-processing
corp <- data_corpus_news2014 
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>% 
   tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>% 
   tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
                 padding = TRUE) %>% 
   tokens_tolower()

# train word2vec
w2v <- textmodel_word2vec(toks, dim = 50, type = "cbow", min_count = 5, sample = 0.001)

# find similar words
head(similarity(w2v, c("berlin", "germany", "france"), mode = "words"))
head(similarity(w2v, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values"))
head(similarity(w2v, analogy(~ berlin - germany + france), mode = "words"))

wordvector documentation built on June 20, 2025, 9:08 a.m.