read_word2vec: Read a word2vec embedding file
In sentencepiece: Text Tokenization using Byte Pair Encoding and Unigram Modelling

read_word2vec

R Documentation

Read a word2vec embedding file

Description

Read a word2vec embedding file as a dense matrix. This uses read.wordvectors from the word2vec package.

Usage

read_word2vec(
  x,
  type = c("txt", "bin"),
  n = .Machine$integer.max,
  encoding = "UTF-8",
  normalize = TRUE
)

Arguments

`x`	path to the file
`type`	either 'bin' or 'txt' indicating the `file` is a binary file or a text file
`n`	integer, indicating to limit the number of words to read in. Defaults to reading all words.
`encoding`	encoding to be assumed for the words. Defaults to 'UTF-8'
`normalize`	logical indicating to normalize the embeddings by dividing by the factor (sqrt(sum(x . x) / length(x))). Defaults to TRUE.

Value

a matrix with one row per token containing the embedding of the token

Examples

folder    <- system.file(package = "sentencepiece", "models")
embedding <- file.path(folder, "nl.wiki.bpe.vs1000.d25.w2v.bin")
embedding <- read_word2vec(embedding, type = "bin")
head(embedding)
embedding <- file.path(folder, "nl.wiki.bpe.vs1000.d25.w2v.txt")
embedding <- read_word2vec(embedding, type = "txt")
head(embedding, n = 10)

sentencepiece documentation built on Nov. 13, 2022, 5:05 p.m.