doc2vec: Get document vectors based on a word2vec model

Description Usage Arguments Value See Also Examples

View source: R/doc2vec.R

Description

Document vectors are the sum of the vectors of the words which are part of the document standardised by the scale of the vector space. This scale is the sqrt of the average inner product of the vector elements.

Usage

1
doc2vec(object, newdata, split = " ", encoding = "UTF-8", ...)

Arguments

object

a word2vec model as returned by word2vec or read.word2vec

newdata

either a list of tokens where each list element is a character vector of tokens which form the document and the list name is considered the document identifier; or a data.frame with columns doc_id and text; or a character vector with texts where the character vector names will be considered the document identifier

split

in case newdata is not a list of tokens, text will be splitted into tokens by splitting based on function strsplit with the provided split argument

encoding

set the encoding of the text elements to the specified encoding. Defaults to 'UTF-8'.

...

not used

Value

a matrix with 1 row per document containing the text document vectors, the rownames of this matrix are the document identifiers

See Also

word2vec, predict.word2vec

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
path  <- system.file(package = "word2vec", "models", "example.bin")
model <- read.word2vec(path)
x <- data.frame(doc_id = c("doc1", "doc2", "testmissingdata"), 
                text = c("there is no toilet. on the bus", "no tokens from dictionary", NA),
                stringsAsFactors = FALSE)
emb <- doc2vec(model, x, type = "embedding")
emb

newdoc <- doc2vec(model, "i like busses with a toilet")
word2vec_similarity(emb, newdoc)

## similar way of extracting embeddings
x <- setNames(object = c("there is no toilet. on the bus", "no tokens from dictionary", NA), 
              nm = c("a", "b", "c"))
emb <- doc2vec(model, x, type = "embedding")
emb

## similar way of extracting embeddings
x <- setNames(object = c("there is no toilet. on the bus", "no tokens from dictionary", NA), 
              nm = c("a", "b", "c"))
x <- strsplit(x, "[ .]")
emb <- doc2vec(model, x, type = "embedding")
emb

## show behaviour in case of NA or character data of no length
x <- list(a = character(), b = c("bus", "toilet"), c = NA)
emb <- doc2vec(model, x, type = "embedding")
emb

Example output

                    [,1]      [,2]      [,3]     [,4]       [,5]       [,6]
doc1            1.480101 0.7821248 0.2432603 1.113621 -0.8062312 -0.9491834
doc2                  NA        NA        NA       NA         NA         NA
testmissingdata       NA        NA        NA       NA         NA         NA
                      [,7]     [,8]      [,9]      [,10]     [,11]     [,12]
doc1            -0.9286526 1.578904 0.8285673 -0.5626678 -1.376739 -1.010849
doc2                    NA       NA        NA         NA        NA        NA
testmissingdata         NA       NA        NA         NA        NA        NA
                   [,13]     [,14]     [,15]
doc1            1.162419 0.3092914 -0.790406
doc2                  NA        NA        NA
testmissingdata       NA        NA        NA
                     [,1]
doc1            0.9337153
doc2                   NA
testmissingdata        NA
      [,1]      [,2]      [,3]     [,4]       [,5]       [,6]       [,7]
a 1.480101 0.7821248 0.2432603 1.113621 -0.8062312 -0.9491834 -0.9286526
b       NA        NA        NA       NA         NA         NA         NA
c       NA        NA        NA       NA         NA         NA         NA
      [,8]      [,9]      [,10]     [,11]     [,12]    [,13]     [,14]
a 1.578904 0.8285673 -0.5626678 -1.376739 -1.010849 1.162419 0.3092914
b       NA        NA         NA        NA        NA       NA        NA
c       NA        NA         NA        NA        NA       NA        NA
      [,15]
a -0.790406
b        NA
c        NA
     [,1]      [,2]     [,3]     [,4]       [,5]       [,6]       [,7]     [,8]
a 1.42267 0.6485758 0.194632 1.112062 -0.7055673 -0.9245175 -0.8477923 1.837379
b      NA        NA       NA       NA         NA         NA         NA       NA
c      NA        NA       NA       NA         NA         NA         NA       NA
       [,9]      [,10]     [,11]    [,12]     [,13]     [,14]      [,15]
a 0.6850823 -0.4411038 -1.446427 -1.11854 0.9178053 0.3416399 -0.9312607
b        NA         NA        NA       NA        NA        NA         NA
c        NA         NA        NA       NA        NA        NA         NA
      [,1]      [,2]      [,3]     [,4]       [,5]       [,6]        [,7]
a       NA        NA        NA       NA         NA         NA          NA
b 1.737935 0.2215403 0.5507991 1.161996 -0.8006175 -0.6428222 -0.04923881
c       NA        NA        NA       NA         NA         NA          NA
      [,8]        [,9]      [,10]     [,11]     [,12]     [,13]     [,14]
a       NA          NA         NA        NA        NA        NA        NA
b 1.278011 -0.07773599 -0.3317408 -1.484094 -1.056723 0.7799516 0.5174833
c       NA          NA         NA        NA        NA        NA        NA
      [,15]
a        NA
b -1.809843
c        NA

word2vec documentation built on July 2, 2021, 5:07 p.m.