read_dsm_matrix: Load DSM Matrix from File (wordspace)

read.dsm.matrixR Documentation

Load DSM Matrix from File (wordspace)

Description

This function loads a DSM matrix from a disk file in the specified format (see section sQuote(Formats) for details).

Usage


read.dsm.matrix(file, format = c("word2vec"),
                encoding = "UTF-8", batchsize = 1e6, verbose=FALSE)

Arguments

file

either a character string naming a file or a connection open for writing (in text mode)

format

input file format (see section sQuote(Formats)). The input file format cannot be guessed automatically.

encoding

character encoding of the input file (ignored if file is a connection)

batchsize

for certain input formats, the matrix is read in batches of batchsize cells each in order to limit memory overhead

verbose

if TRUE, show progress bar when reading in batches

Details

In order to read text formats from a compressed file, pass a gzfile, bzfile or xzfile connection with appropriate encoding in the argument file. Make sure not to open the connection before passing it to read.dsm.matrix.

Formats

Currently, the only supported file format is word2vec.

word2vec

This widely used text format for word embeddings is only suitable for a dense matrix. Row labels must be unique and may not contain whitespace. Values are usually rounded to a few decimal digits in order to keep file size manageable.

The first line of the file lists the matrix dimensions (rows, columns) separated by a single blank. It is followed by one text line for each matrix row, starting with the row label. The label and are cells are separated by single blanks, so row labels cannot contain whitespace.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

See Also

write.dsm.matrix, read.dsm.triplet, read.dsm.ucs

Examples

fn <- system.file("extdata", "word2vec_hiero.txt", package="wordspace")
read.dsm.matrix(fn, format="word2vec")

wordspace documentation built on Aug. 23, 2022, 1:06 a.m.