textmodel_word2vec | R Documentation |
Train a Word2vec model (Mikolov et al., 2023) in different architectures on a quanteda::tokens object.
textmodel_word2vec(
x,
dim = 50,
type = c("cbow", "skip-gram"),
min_count = 5L,
window = ifelse(type == "cbow", 5L, 10L),
iter = 10L,
alpha = 0.05,
use_ns = TRUE,
ns_size = 5L,
sample = 0.001,
normalize = TRUE,
verbose = FALSE,
...
)
x |
a quanteda::tokens object. |
dim |
the size of the word vectors. |
type |
the architecture of the model; either "cbow" (continuous back of words) or "skip-gram". |
min_count |
the minimum frequency of the words. Words less frequent than
this in |
window |
the size of the word window. Words within this window are considered to be the context of a target word. |
iter |
the number of iterations in model training. |
alpha |
the initial learning rate. |
use_ns |
if |
ns_size |
the size of negative samples. Only used when |
sample |
the rate of sampling of words based on their frequency. Sampling is
disabled when |
normalize |
if |
verbose |
if |
... |
additional arguments. |
User can changed the number of processors used for the parallel computing via
options(wordvector_threads)
.
Returns a textmodel_wordvector object with the following elements:
values |
a matrix for word vector values. |
weights |
a matrix for word vector weights. |
dim |
the size of the word vectors. |
type |
the architecture of the model. |
frequency |
the frequency of words in |
window |
the size of the word window. |
iter |
the number of iterations in model training. |
alpha |
the initial learning rate. |
use_ns |
the use of negative sampling. |
ns_size |
the size of negative samples. |
concatenator |
the concatenator in |
call |
the command used to execute the function. |
version |
the version of the wordvector package. |
Mikolov, T., Sutskever, I., Chen, K., Corrado, G., & Dean, J. (2013). Distributed Representations of Words and Phrases and their Compositionality. https://arxiv.org/abs/1310.4546.
library(quanteda)
library(wordvector)
# pre-processing
corp <- data_corpus_news2014
toks <- tokens(corp, remove_punct = TRUE, remove_symbols = TRUE) %>%
tokens_remove(stopwords("en", "marimo"), padding = TRUE) %>%
tokens_select("^[a-zA-Z-]+$", valuetype = "regex", case_insensitive = FALSE,
padding = TRUE) %>%
tokens_tolower()
# train word2vec
w2v <- textmodel_word2vec(toks, dim = 50, type = "cbow", min_count = 5, sample = 0.001)
# find similar words
head(similarity(w2v, c("berlin", "germany", "france"), mode = "words"))
head(similarity(w2v, c("berlin" = 1, "germany" = -1, "france" = 1), mode = "values"))
head(similarity(w2v, analogy(~ berlin - germany + france), mode = "words"))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.