doc2vec: Get document vectors based on a word2vec model

View source: R/doc2vec.R

doc2vecR Documentation

Get document vectors based on a word2vec model

Description

Document vectors are the sum of the vectors of the words which are part of the document standardised by the scale of the vector space. This scale is the sqrt of the average inner product of the vector elements.

Usage

doc2vec(object, newdata, split = " ", encoding = "UTF-8", ...)

Arguments

object

a word2vec model as returned by word2vec or read.word2vec

newdata

either a list of tokens where each list element is a character vector of tokens which form the document and the list name is considered the document identifier; or a data.frame with columns doc_id and text; or a character vector with texts where the character vector names will be considered the document identifier

split

in case newdata is not a list of tokens, text will be splitted into tokens by splitting based on function strsplit with the provided split argument

encoding

set the encoding of the text elements to the specified encoding. Defaults to 'UTF-8'.

...

not used

Value

a matrix with 1 row per document containing the text document vectors, the rownames of this matrix are the document identifiers

See Also

word2vec, predict.word2vec

Examples

path  <- system.file(package = "word2vec", "models", "example.bin")
model <- read.word2vec(path)
x <- data.frame(doc_id = c("doc1", "doc2", "testmissingdata"), 
                text = c("there is no toilet. on the bus", "no tokens from dictionary", NA),
                stringsAsFactors = FALSE)
emb <- doc2vec(model, x, type = "embedding")
emb

newdoc <- doc2vec(model, "i like busses with a toilet")
word2vec_similarity(emb, newdoc)

## similar way of extracting embeddings
x <- setNames(object = c("there is no toilet. on the bus", "no tokens from dictionary", NA), 
              nm = c("a", "b", "c"))
emb <- doc2vec(model, x, type = "embedding")
emb

## similar way of extracting embeddings
x <- setNames(object = c("there is no toilet. on the bus", "no tokens from dictionary", NA), 
              nm = c("a", "b", "c"))
x <- strsplit(x, "[ .]")
emb <- doc2vec(model, x, type = "embedding")
emb

## show behaviour in case of NA or character data of no length
x <- list(a = character(), b = c("bus", "toilet"), c = NA)
emb <- doc2vec(model, x, type = "embedding")
emb

word2vec documentation built on Oct. 8, 2023, 1:07 a.m.