word2vec.character: Train a word2vec model on text

View source: R/word2vec.R

word2vec.characterR Documentation

Train a word2vec model on text


Construct a word2vec model on text. The algorithm is explained at https://arxiv.org/pdf/1310.4546.pdf


## S3 method for class 'character'
  type = c("cbow", "skip-gram"),
  dim = 50,
  window = ifelse(type == "cbow", 5L, 10L),
  iter = 5L,
  lr = 0.05,
  hs = FALSE,
  negative = 5L,
  sample = 0.001,
  min_count = 5L,
  stopwords = character(),
  threads = 1L,
  split = c(" \n,.-!?:;/\"#$%&'()*+<=>@[]\\^_`{|}~\t\v\f\r", ".\n?!"),
  encoding = "UTF-8",
  useBytes = TRUE,



a character vector with text or the path to the file on disk containing training data or a list of tokens. See the examples.


the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'cbow'


dimension of the word vectors. Defaults to 50.


skip length between words. Defaults to 5.


number of training iterations. Defaults to 5.


initial learning rate also known as alpha. Defaults to 0.05


logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling.


integer with the number of negative samples. Only used in case hs is set to FALSE


threshold for occurrence of words. Defaults to 0.001


integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5.


a character vector of stopwords to exclude from training


number of CPU threads to use. Defaults to 1.


a character vector of length 2 where the first element indicates how to split words and the second element indicates how to split sentences in x


the encoding of x and stopwords. Defaults to 'UTF-8'. Calculating the model always starts from files allowing to build a model on large corpora. The encoding argument is passed on to file when writing x to hard disk in case you provided it as a character vector.


logical passed on to writeLines when writing the text and stopwords on disk before building the model. Defaults to TRUE.


further arguments passed on to the methods word2vec.character, word2vec.list as well as the C++ function w2v_train - for expert use only


Some advice on the optimal set of parameters to use for training as defined by Mikolov et al.

  • argument type: skip-gram (slower, better for infrequent words) vs cbow (fast)

  • argument hs: the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)

  • argument dim: dimensionality of the word vectors: usually more is better, but not always

  • argument window: for skip-gram usually around 10, for cbow around 5

  • argument sample: sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 0.001 to 0.00001)


an object of class w2v_trained which is a list with elements

  • model: a Rcpp pointer to the model

  • data: a list with elements file: the training data used, stopwords: the character vector of stopwords, n

  • vocabulary: the number of words in the vocabulary

  • success: logical indicating if training succeeded

  • error_log: the error log in case training failed

  • control: as list of the training arguments used, namely min_count, dim, window, iter, lr, skipgram, hs, negative, sample, split_words, split_sents, expTableSize and expValueMax


https://github.com/maxoodf/word2vec, https://arxiv.org/pdf/1310.4546.pdf

See Also

predict.word2vec, as.matrix.word2vec, word2vec, word2vec.character, word2vec.list


## Take data and standardise it a bit
data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- tolower(x$feedback)

## Build the model get word embeddings and nearest neighbours
model <- word2vec(x = x, dim = 15, iter = 20)
emb   <- as.matrix(model)
emb   <- predict(model, c("bus", "toilet", "unknownword"), type = "embedding")
nn    <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)

## Get vocabulary
vocab   <- summary(model, type = "vocabulary")

# Do some calculations with the vectors and find similar terms to these
emb     <- as.matrix(model)
vector  <- emb["buurt", ] - emb["rustige", ] + emb["restaurants", ]
predict(model, vector, type = "nearest", top_n = 10)

vector  <- emb["gastvrouw", ] - emb["gastvrij", ]
predict(model, vector, type = "nearest", top_n = 5)

vectors <- emb[c("gastheer", "gastvrouw"), ]
vectors <- rbind(vectors, avg = colMeans(vectors))
predict(model, vectors, type = "nearest", top_n = 10)

## Save the model to hard disk
path <- "mymodel.bin"

write.word2vec(model, file = path)
model <- read.word2vec(path)

## Example of word2vec with a list of tokens 
toks  <- strsplit(x, split = "[[:space:][:punct:]]+")
model <- word2vec(x = toks, dim = 15, iter = 20)
emb   <- as.matrix(model)
emb   <- predict(model, c("bus", "toilet", "unknownword"), type = "embedding")
nn    <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)

## Example getting word embeddings 
##   which are different depending on the parts of speech tag
## Look to the help of the udpipe R package 
##   to get parts of speech tags on text
data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language == "fr")
x <- subset(x, grepl(xpos, pattern = paste(LETTERS, collapse = "|")))
x$text <- sprintf("%s/%s", x$lemma, x$xpos)
x <- subset(x, !is.na(lemma))
x <- split(x$text, list(x$doc_id, x$sentence_id))

model <- word2vec(x = x, dim = 15, iter = 20)
emb   <- as.matrix(model)
nn    <- predict(model, c("cuisine/NN", "rencontrer/VB"), type = "nearest")
nn    <- predict(model, c("accueillir/VBN", "accueillir/VBG"), type = "nearest")

word2vec documentation built on Oct. 8, 2023, 1:07 a.m.