write_dsm_matrix: Export DSM Matrix to File (wordspace)

Description Usage Arguments Details Formats Author(s) See Also Examples

Description

This function exports a DSM matrix to a disk file in the specified format (see section sQuote(Formats) for details).

Usage

1
2
write.dsm.matrix(x, file, format = c("word2vec"), round=FALSE,
                 encoding = "UTF-8", batchsize = 1e6, verbose=FALSE)

Arguments

x

a dense or sparse matrix representing a DSM, or an object of class dsm

file

either a character string naming a file or a connection open for writing (in text mode)

format

desired output file format. See section sQuote(Formats) for a list of available formats and their limitations.

round

for some output formats, numbers can be rounded to the specified number of decimal digits in order to reduce file size

encoding

character encoding of the output file (ignored if file is a connection)

batchsize

for certain output formats, the matrix is written in batches of batchsize cells each in order to limit memory overhead

verbose

if TRUE, show progress bar when writing in batches

Details

In order to save text formats to a compressed file, pass a gzfile, bzfile or xzfile connection with appropriate encoding in the argument file. Make sure not to open the connection before passing it to write.dsm.matrix. See section ‘Examples’ below.

Formats

Currently, the only supported file format is word2vec.

word2vec

This widely used text format for word embeddings is only suitable for a dense matrix. Row labels must be unique and may not contain whitespace. Values are usually rounded to a few decimal digits in order to keep file size manageable.

The first line of the file lists the matrix dimensions (rows, columns) separated by a single blank. It is followed by one text line for each matrix row, starting with the row label. The label and are cells are separated by single blanks, so row labels cannot contain whitespace.

Author(s)

Stefan Evert (http://purl.org/stefan.evert)

See Also

read.dsm.matrix

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
model <- dsm.score(DSM_TermTerm, score="MI", normalize=TRUE) # a typical DSM

# save in word2vec text format (rounded to 3 digits)
fn <- tempfile(fileext=".txt")
write.dsm.matrix(model, fn, format="word2vec", round=3)
cat(readLines(fn), sep="\n")

# save as compressed file in word2vec format
fn <- tempfile(fileext=".txt.gz")
fh <- gzfile(fn, encoding="UTF-8") # need to set file encoding here
write.dsm.matrix(model, fh, format="word2vec", round=3)
# write.dsm.matrix() automatically opens and closes the connection
cat(readLines(gzfile(fn)), sep="\n")

wordspace documentation built on Jan. 9, 2020, 1:08 a.m.