Description Usage Arguments Value See Also Examples
View source: R/paragraph2vec.R
Use the paragraph2vec model to
get the embedding of documents, sentences or words
find the nearest documents/words which are similar to either a set of documents, words or a set of sentences containing words
1 2 3 4 5 6 7 8 9 10 11 |
object |
a paragraph2vec model as returned by |
newdata |
either a character vector of words, a character vector of doc_id's or a list of sentences
where the list elements are words part of the model dictionary. What needs to be provided depends on the argument you provide in |
type |
either 'embedding' or 'nearest' to get the embeddings or to find the closest text items. Defaults to 'nearest'. |
which |
either one of 'docs', 'words', 'doc2doc', 'word2doc', 'word2word' or 'sent2doc' where
|
top_n |
show only the top n nearest neighbours. Defaults to 10, with a maximum value of 100. Only used for |
encoding |
set the encoding of the text elements to the specified encoding. Defaults to 'UTF-8'. |
normalize |
logical indicating to normalize the embeddings. Defaults to |
... |
not used |
depending on the type, you get a different output:
for type nearest: returns a list of data.frames with columns term1, term2, similarity and rank indicating the elements which are closest to the provided newdata
for type embedding: a matrix of embeddings of the words/documents or sentences provided in newdata
,
rownames are either taken from the words/documents or list names of the sentences. The matrix has always the
same number of rows as the length of newdata
, possibly with NA values if the word/doc_id is not part of the dictionary
See the examples.
paragraph2vec
, read.paragraph2vec
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 | library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
x <- belgium_parliament
x <- subset(x, language %in% "dutch")
x <- subset(x, nchar(text) > 0 & txt_count_words(text) < 1000)
x$doc_id <- sprintf("doc_%s", 1:nrow(x))
x$text <- tolower(x$text)
x$text <- gsub("[^[:alpha:]]", " ", x$text)
x$text <- gsub("[[:space:]]+", " ", x$text)
x$text <- trimws(x$text)
## Build model
model <- paragraph2vec(x = x, type = "PV-DM", dim = 15, iter = 5)
model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20)
sentences <- list(
example = c("geld", "diabetes"),
hi = c("geld", "diabetes", "koning"),
test = c("geld"),
nothing = character(),
repr = c("geld", "diabetes", "koning"))
## Get embeddings (type = 'embedding')
predict(model, newdata = c("geld", "koning", "unknownword", NA, "</s>", ""),
type = "embedding", which = "words")
predict(model, newdata = c("doc_1", "doc_10", "unknowndoc", NA, "</s>"),
type = "embedding", which = "docs")
predict(model, sentences, type = "embedding")
## Get most similar items (type = 'nearest')
predict(model, newdata = c("doc_1", "doc_10"), type = "nearest", which = "doc2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2word")
predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 7)
## Similar way on extracting similarities
emb <- predict(model, sentences, type = "embedding")
emb_docs <- as.matrix(model, type = "docs")
paragraph2vec_similarity(emb, emb_docs, top_n = 3)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.