predict.paragraph2vec: Predict functionalities for a paragraph2vec model
In doc2vec: Distributed Representations of Sentences, Documents and Topics

Description Usage Arguments Value See Also Examples

Use the paragraph2vec model to

get the embedding of documents, sentences or words
find the nearest documents/words which are similar to either a set of documents, words or a set of sentences containing words

## S3 method for class 'paragraph2vec'
predict(
  object,
  newdata,
  type = c("embedding", "nearest"),
  which = c("docs", "words", "doc2doc", "word2doc", "word2word", "sent2doc"),
  top_n = 10L,
  encoding = "UTF-8",
  normalize = TRUE,
  ...
)

`object`	a paragraph2vec model as returned by `paragraph2vec` or `read.paragraph2vec`
`newdata`	either a character vector of words, a character vector of doc_id's or a list of sentences where the list elements are words part of the model dictionary. What needs to be provided depends on the argument you provide in `which`. See the examples.
`type`	either 'embedding' or 'nearest' to get the embeddings or to find the closest text items. Defaults to 'nearest'.
`which`	either one of 'docs', 'words', 'doc2doc', 'word2doc', 'word2word' or 'sent2doc' where 'docs' or 'words' can be chosen if `type` is set to 'embedding' to indicate that `newdata` contains either doc_id's or words 'doc2doc', 'word2doc', 'word2word', 'sent2doc' can be chosen if `type` is set to 'nearest' indicating to extract respectively the closest document to a document (doc2doc), the closest document to a word (word2doc), the closest word to a word (word2word) or the closest document to sentences (sent2doc).
`top_n`	show only the top n nearest neighbours. Defaults to 10, with a maximum value of 100. Only used for `type` 'nearest'.
`encoding`	set the encoding of the text elements to the specified encoding. Defaults to 'UTF-8'.
`normalize`	logical indicating to normalize the embeddings. Defaults to `TRUE`. Only used for `type` 'embedding'.
`...`	not used

depending on the type, you get a different output:

for type nearest: returns a list of data.frames with columns term1, term2, similarity and rank indicating the elements which are closest to the provided newdata
for type embedding: a matrix of embeddings of the words/documents or sentences provided in newdata, rownames are either taken from the words/documents or list names of the sentences. The matrix has always the same number of rows as the length of newdata, possibly with NA values if the word/doc_id is not part of the dictionary

See the examples.

paragraph2vec, read.paragraph2vec

library(tokenizers.bpe)
data(belgium_parliament, package = "tokenizers.bpe")
x <- belgium_parliament
x <- subset(x, language %in% "dutch")
x <- subset(x, nchar(text) > 0 & txt_count_words(text) < 1000)
x$doc_id <- sprintf("doc_%s", 1:nrow(x))
x$text   <- tolower(x$text)
x$text   <- gsub("[^[:alpha:]]", " ", x$text)
x$text   <- gsub("[[:space:]]+", " ", x$text)
x$text   <- trimws(x$text)

## Build model
model <- paragraph2vec(x = x, type = "PV-DM",   dim = 15,  iter = 5)

model <- paragraph2vec(x = x, type = "PV-DBOW", dim = 100, iter = 20)


sentences <- list(
  example = c("geld", "diabetes"),
  hi = c("geld", "diabetes", "koning"),
  test = c("geld"),
  nothing = character(), 
  repr = c("geld", "diabetes", "koning"))
  
## Get embeddings (type =  'embedding')
predict(model, newdata = c("geld", "koning", "unknownword", NA, "</s>", ""), 
               type = "embedding", which = "words")
predict(model, newdata = c("doc_1", "doc_10", "unknowndoc", NA, "</s>"), 
               type = "embedding", which = "docs")
predict(model, sentences, type = "embedding")

## Get most similar items (type =  'nearest')
predict(model, newdata = c("doc_1", "doc_10"), type = "nearest", which = "doc2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2doc")
predict(model, newdata = c("geld", "koning"), type = "nearest", which = "word2word")
predict(model, newdata = sentences, type = "nearest", which = "sent2doc", top_n = 7)

## Similar way on extracting similarities
emb <- predict(model, sentences, type = "embedding")
emb_docs <- as.matrix(model, type = "docs")
paragraph2vec_similarity(emb, emb_docs, top_n = 3)