word2vec: Train a word2vec model on text

Description Usage Arguments Details Value References See Also Examples

View source: R/word2vec.R

Description

Construct a word2vec model on text. The algorithm is explained at https://arxiv.org/pdf/1310.4546.pdf

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
word2vec(
  x,
  type = c("cbow", "skip-gram"),
  dim = 50,
  window = ifelse(type == "cbow", 5L, 10L),
  iter = 5L,
  lr = 0.05,
  hs = FALSE,
  negative = 5L,
  sample = 0.001,
  min_count = 5L,
  split = c(" \n,.-!?:;/\"#$%&'()*+<=>@[]\\^_`{|}~\t\v\f\r", ".\n?!"),
  stopwords = character(),
  threads = 1L,
  encoding = "UTF-8",
  ...
)

Arguments

x

a character vector with text or the path to the file on disk containing training data

type

the type of algorithm to use, either 'cbow' or 'skip-gram'. Defaults to 'cbow'

dim

dimension of the word vectors. Defaults to 50.

window

skip length between words. Defaults to 5.

iter

number of training iterations. Defaults to 5.

lr

initial learning rate also known as alpha. Defaults to 0.05

hs

logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling.

negative

integer with the number of negative samples. Only used in case hs is set to FALSE

sample

threshold for occurrence of words. Defaults to 0.001

min_count

integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5.

split

a character vector of length 2 where the first element indicates how to split words and the second element indicates how to split sentences in x

stopwords

a character vector of stopwords to exclude from training

threads

number of CPU threads to use. Defaults to 1.

encoding

the encoding of x and stopwords. Defaults to 'UTF-8'. Calculating the model always starts from files allowing to build a model on large corpora. The encoding argument is passed on to file when writing x to hard disk in case you provided it as a character vector.

...

further arguments passed on to the C++ function w2v_train - for expert use only

Details

Some advice on the optimal set of parameters to use for training as defined by Mikolov et al.

Value

an object of class w2v_trained which is a list with elements

References

https://github.com/maxoodf/word2vec, https://arxiv.org/pdf/1310.4546.pdf

See Also

predict.word2vec, as.matrix.word2vec

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
library(udpipe)
## Take data and standardise it a bit
data(brussels_reviews, package = "udpipe")
x <- subset(brussels_reviews, language == "nl")
x <- tolower(x$feedback)

## Build the model get word embeddings and nearest neighbours
model <- word2vec(x = x, dim = 15, iter = 20)
emb   <- as.matrix(model)
head(emb)
emb <- predict(model, c("bus", "toilet", "unknownword"), type = "embedding")
emb
nn  <- predict(model, c("bus", "toilet"), type = "nearest", top_n = 5)
nn

## Get vocabulary
vocab <- summary(model, type = "vocabulary")

# Do some calculations with the vectors and find similar terms to these
emb <- as.matrix(model)
vector <- emb["buurt", ] - emb["rustige", ] + emb["restaurants", ]
predict(model, vector, type = "nearest", top_n = 10)

vector <- emb["gastvrouw", ] - emb["gastvrij", ]
predict(model, vector, type = "nearest", top_n = 5)

vectors <- emb[c("gastheer", "gastvrouw"), ]
vectors <- rbind(vectors, avg = colMeans(vectors))
predict(model, vectors, type = "nearest", top_n = 10)

## Save the model to hard disk
path <- "mymodel.bin"

write.word2vec(model, file = path)
model <- read.word2vec(path)




## 
## Example getting word embeddings 
##   which are different depending on the parts of speech tag
## Look to the help of the udpipe R package 
##   to get parts of speech tags on text
## 
library(udpipe)
data(brussels_reviews_anno, package = "udpipe")
x <- subset(brussels_reviews_anno, language == "fr")
x <- subset(x, grepl(xpos, pattern = paste(LETTERS, collapse = "|")))
x$text <- sprintf("%s/%s", x$lemma, x$xpos)
x <- subset(x, !is.na(lemma))
x <- paste.data.frame(x, term = "text", group = "doc_id", collapse = " ")
x <- x$text

model <- word2vec(x = x, dim = 15, iter = 20, split = c(" ", ".\n?!"))
emb   <- as.matrix(model)
nn    <- predict(model, c("cuisine/NN", "rencontrer/VB"), type = "nearest")
nn
nn    <- predict(model, c("accueillir/VBN", "accueillir/VBG"), type = "nearest")
nn

word2vec documentation built on July 2, 2021, 5:07 p.m.