word2vec.character: Train a word2vec model on text
In word2vec: Distributed Representations of Words

View source: R/word2vec.R

word2vec.character

R Documentation

Train a word2vec model on text

Description

Construct a word2vec model on text. The algorithm is explained at https://arxiv.org/pdf/1310.4546.pdf

Usage

## S3 method for class 'character'
word2vec(
  x,
  type = c("cbow", "skip-gram"),
  dim = 50,
  window = ifelse(type == "cbow", 5L, 10L),
  iter = 5L,
  lr = 0.05,
  hs = FALSE,
  negative = 5L,
  sample = 0.001,
  min_count = 5L,
  stopwords = character(),
  threads = 1L,
  split = c(" \n,.-!?:;/\"#$%&'()*+<=>@[]\\^_`{|}~\t\v\f\r", ".\n?!"),
  encoding = "UTF-8",
  useBytes = TRUE,
  ...
)

Arguments

`x`	a character vector with text or the path to the file on disk containing training data or a list of tokens. See the examples.
`type`	the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'cbow'
`dim`	dimension of the word vectors. Defaults to 50.
`window`	skip length between words. Defaults to 5.
`iter`	number of training iterations. Defaults to 5.
`lr`	initial learning rate also known as alpha. Defaults to 0.05
`hs`	logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling.
`negative`	integer with the number of negative samples. Only used in case hs is set to FALSE
`sample`	threshold for occurrence of words. Defaults to 0.001
`min_count`	integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5.
`stopwords`	a character vector of stopwords to exclude from training
`threads`	number of CPU threads to use. Defaults to 1.
`split`	a character vector of length 2 where the first element indicates how to split words and the second element indicates how to split sentences in `x`
`encoding`	the encoding of `x` and `stopwords`. Defaults to 'UTF-8'. Calculating the model always starts from files allowing to build a model on large corpora. The encoding argument is passed on to `file` when writing `x` to hard disk in case you provided it as a character vector.
`useBytes`	logical passed on to `writeLines` when writing the text and stopwords on disk before building the model. Defaults to `TRUE`.
`...`	further arguments passed on to the methods `word2vec.character`, `word2vec.list` as well as the C++ function `w2v_train` - for expert use only

Details

Some advice on the optimal set of parameters to use for training as defined by Mikolov et al.

argument type: skip-gram (slower, better for infrequent words) vs cbow (fast)
argument hs: the training algorithm: hierarchical softmax (better for infrequent words) vs negative sampling (better for frequent words, better with low dimensional vectors)
argument dim: dimensionality of the word vectors: usually more is better, but not always
argument window: for skip-gram usually around 10, for cbow around 5
argument sample: sub-sampling of frequent words: can improve both accuracy and speed for large data sets (useful values are in range 0.001 to 0.00001)

Value

an object of class w2v_trained which is a list with elements

model: a Rcpp pointer to the model
data: a list with elements file: the training data used, stopwords: the character vector of stopwords, n
vocabulary: the number of words in the vocabulary
success: logical indicating if training succeeded
error_log: the error log in case training failed
control: as list of the training arguments used, namely min_count, dim, window, iter, lr, skipgram, hs, negative, sample, split_words, split_sents, expTableSize and expValueMax

References

https://github.com/maxoodf/word2vec, https://arxiv.org/pdf/1310.4546.pdf

Examples


library(udpipe)
## Take data and standardise it a bit
data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- tolower(x$feedback)

## Build the model get word embeddings and nearest neighbours
model <- word2vec(x = x, dim = 15, iter = 20)
emb   <- as.matrix(model)
head(emb)
emb   <- predict(model, c("bus", "toilet", "unknownword"), type = "embedding")
emb
nn    <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)
nn

## Get vocabulary
vocab   <- summary(model, type = "vocabulary")

# Do some calculations with the vectors and find similar terms to these
emb     <- as.matrix(model)
vector  <- emb["buurt", ] - emb["rustige", ] + emb["restaurants", ]
predict(model, vector, type = "nearest", top_n = 10)

vector  <- emb["gastvrouw", ] - emb["gastvrij", ]
predict(model, vector, type = "nearest", top_n = 5)

vectors <- emb[c("gastheer", "gastvrouw"), ]
vectors <- rbind(vectors, avg = colMeans(vectors))
predict(model, vectors, type = "nearest", top_n = 10)

## Save the model to hard disk
path <- "mymodel.bin"

write.word2vec(model, file = path)
model <- read.word2vec(path)


## 
## Example of word2vec with a list of tokens 
## 
toks  <- strsplit(x, split = "[[:space:][:punct:]]+")
model <- word2vec(x = toks, dim = 15, iter = 20)
emb   <- as.matrix(model)
emb   <- predict(model, c("bus", "toilet", "unknownword"), type = "embedding")
emb
nn    <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)
nn

## 
## Example getting word embeddings 
##   which are different depending on the parts of speech tag
## Look to the help of the udpipe R package 
##   to get parts of speech tags on text
## 
library(udpipe)
data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language == "fr")
x <- subset(x, grepl(xpos, pattern = paste(LETTERS, collapse = "|")))
x$text <- sprintf("%s/%s", x$lemma, x$xpos)
x <- subset(x, !is.na(lemma))
x <- split(x$text, list(x$doc_id, x$sentence_id))

model <- word2vec(x = x, dim = 15, iter = 20)
emb   <- as.matrix(model)
nn    <- predict(model, c("cuisine/NN", "rencontrer/VB"), type = "nearest")
nn
nn    <- predict(model, c("accueillir/VBN", "accueillir/VBG"), type = "nearest")
nn

word2vec documentation built on Oct. 8, 2023, 1:07 a.m.