word2vec.list | R Documentation |
Construct a word2vec model on text. The algorithm is explained at https://arxiv.org/pdf/1310.4546.pdf
## S3 method for class 'list'
word2vec(
x,
type = c("cbow", "skip-gram"),
dim = 50,
window = ifelse(type == "cbow", 5L, 10L),
iter = 5L,
lr = 0.05,
hs = FALSE,
negative = 5L,
sample = 0.001,
min_count = 5L,
stopwords = character(),
threads = 1L,
...
)
x |
a character vector with text or the path to the file on disk containing training data or a list of tokens. See the examples. |
type |
the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'cbow' |
dim |
dimension of the word vectors. Defaults to 50. |
window |
skip length between words. Defaults to 5. |
iter |
number of training iterations. Defaults to 5. |
lr |
initial learning rate also known as alpha. Defaults to 0.05 |
hs |
logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling. |
negative |
integer with the number of negative samples. Only used in case hs is set to FALSE |
sample |
threshold for occurrence of words. Defaults to 0.001 |
min_count |
integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5. |
stopwords |
a character vector of stopwords to exclude from training |
threads |
number of CPU threads to use. Defaults to 1. |
... |
further arguments passed on to the methods |
Some advice on the optimal set of parameters to use for training as defined by Mikolov et al.
argument type: skip-gram (slower, better for infrequent words) vs cbow (fast)
argument hs: the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)
argument dim: dimensionality of the word vectors: usually more is better, but not always
argument window: for skip-gram usually around 10, for cbow around 5
argument sample: sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 0.001 to 0.00001)
an object of class w2v_trained
which is a list with elements
model: a Rcpp pointer to the model
data: a list with elements file: the training data used, stopwords: the character vector of stopwords, n
vocabulary: the number of words in the vocabulary
success: logical indicating if training succeeded
error_log: the error log in case training failed
control: as list of the training arguments used, namely min_count, dim, window, iter, lr, skipgram, hs, negative, sample, split_words, split_sents, expTableSize and expValueMax
https://github.com/maxoodf/word2vec, https://arxiv.org/pdf/1310.4546.pdf
predict.word2vec
, as.matrix.word2vec
, word2vec
, word2vec.character
, word2vec.list
library(udpipe)
data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- tolower(x$feedback)
toks <- strsplit(x, split = "[[:space:][:punct:]]+")
model <- word2vec(x = toks, dim = 15, iter = 20)
emb <- as.matrix(model)
head(emb)
emb <- predict(model, c("bus", "toilet", "unknownword"), type = "embedding")
emb
nn <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)
nn
##
## Example of word2vec with a list of tokens
## which gives the same embeddings as with a similarly tokenised character vector of texts
##
txt <- txt_clean_word2vec(x, ascii = TRUE, alpha = TRUE, tolower = TRUE, trim = TRUE)
table(unlist(strsplit(txt, "")))
toks <- strsplit(txt, split = " ")
set.seed(1234)
modela <- word2vec(x = toks, dim = 15, iter = 20)
set.seed(1234)
modelb <- word2vec(x = txt, dim = 15, iter = 20, split = c(" \n\r", "\n\r"))
all.equal(as.matrix(modela), as.matrix(modelb))
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.