Description Usage Arguments Value References See Also Examples
View source: R/paragraph2vec.R
Construct a paragraph2vec model on text.
The algorithm is explained at https://arxiv.org/pdf/1405.4053.pdf.
People also refer to this model as doc2vec.
The model is an extension to the word2vec algorithm,
where an additional vector for every paragraph is added directly in the training.
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
x |
a data.frame with columns doc_id and text or the path to the file on disk containing training data. |
type |
character string with the type of algorithm to use, either one of
Defaults to 'PV-DBOW'. |
dim |
dimension of the word and paragraph vectors. Defaults to 50. |
window |
skip length between words. Defaults to 10 for PV-DM and 5 for PV-DBOW |
iter |
number of training iterations. Defaults to 20. |
lr |
initial learning rate also known as alpha. Defaults to 0.05 |
hs |
logical indicating to use hierarchical softmax instead of negative sampling. Defaults to FALSE indicating to do negative sampling. |
negative |
integer with the number of negative samples. Only used in case hs is set to FALSE |
sample |
threshold for occurrence of words. Defaults to 0.001 |
min_count |
integer indicating the number of time a word should occur to be considered as part of the training vocabulary. Defaults to 5. |
threads |
number of CPU threads to use. Defaults to 1. |
encoding |
the encoding of |
embeddings |
optionally a matrix with pretrained word embeddings which will be used to initialise the word embedding space with (transfer learning).
The rownames of this matrix should consist of words. Only words overlapping with the vocabulary extracted from |
... |
further arguments passed on to the C++ function |
an object of class paragraph2vec_trained
which is a list with elements
model: a Rcpp pointer to the model
data: a list with elements file: the training data used, n (the number of words in the training data), n_vocabulary (number of words in the vocabulary) and n_docs (number of documents)
control: a list of the training arguments used, namely min_count, dim, window, iter, lr, skipgram, hs, negative, sample
https://arxiv.org/pdf/1405.4053.pdf, https://groups.google.com/g/word2vec-toolkit/c/Q49FIrNOQRo/m/J6KG8mUj45sJ
predict.paragraph2vec
, as.matrix.paragraph2vec
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | library(tokenizers.bpe)
## Take data and standardise it a bit
data(belgium_parliament, package = "tokenizers.bpe")
str(belgium_parliament)
x <- subset(belgium_parliament, language %in% "french")
x$text <- tolower(x$text)
x$text <- gsub("[^[:alpha:]]", " ", x$text)
x$text <- gsub("[[:space:]]+", " ", x$text)
x$text <- trimws(x$text)
x$nwords <- txt_count_words(x$text)
x <- subset(x, nwords < 1000 & nchar(text) > 0)
## Build the model
model <- paragraph2vec(x = x, type = "PV-DM", dim = 15, iter = 5)
model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20)
str(model)
embedding <- as.matrix(model, which = "words")
embedding <- as.matrix(model, which = "docs")
head(embedding)
## Get vocabulary
vocab <- summary(model, type = "vocabulary", which = "docs")
vocab <- summary(model, type = "vocabulary", which = "words")
## Transfer learning using existing word embeddings
library(word2vec)
w2v <- word2vec(x$text, dim = 50, type = "cbow", iter = 20, min_count = 5)
emb <- as.matrix(w2v)
model <- paragraph2vec(x = x, dim = 50, type = "PV-DM", iter = 20, min_count = 5,
embeddings = emb)
## Transfer learning - proof of concept without learning (iter=0, set to higher to learn)
emb <- matrix(rnorm(30), nrow = 2, dimnames = list(c("en", "met")))
model <- paragraph2vec(x = x, type = "PV-DM", dim = 15, iter = 0, embeddings = emb)
embedding <- as.matrix(model, which = "words", normalize = FALSE)
embedding[c("en", "met"), ]
emb
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.