read_dsm_ucs: Load Raw DSM Data from Disk Files in UCS Export Format...

read.dsm.ucsR Documentation

Load Raw DSM Data from Disk Files in UCS Export Format (wordspace)

Description

This function loads raw DSM data – a cooccurrence frequency matrix and tables of marginal frequencies – in UCS export format. The data are read from a directory containing several text files with predefined names, which can optionally be compressed (see ‘File Format’ below for details).

Usage


read.dsm.ucs(filename, encoding = getOption("encoding"), verbose = FALSE)

Arguments

filename

the name of a directory containing files with the raw DSM data.

encoding

character encoding of the input files, which will automatically be converted to R's internal representation if possible. See ‘Encoding’ in file for details.

verbose

if TRUE, a few progress and information messages are shown

Value

An object of class dsm containing a dense or sparse DSM.

Note that the information tables for target terms (field rows) and feature terms (field cols) include the correct marginal frequencies from the UCS export files. Nonzero counts for rows are and columns are added automatically unless they are already present in the disk files. Additional fields from the information tables as well as all global variables are preserved with their original names.

File Format

The UCS export format is a directory containing the following files with the specified names:

  • MorM.mtx

    cooccurrence matrix (dense, plain text) or sparse matrix (MatrixMarket format)

  • rows.tbl

    row information (labels term, marginal frequencies f)

  • cols.tbl

    column information (labels term, marginal frequencies f)

  • globals.tbl

    table with single row containing global variables; must include variable N specifying sample size

Each individual file may be compressed with an additional filename extension .gz, .bz2 or .xz; read.dsm.ucs automatically decompresses such files when loading them.

Author(s)

Stephanie Evert (https://purl.org/stephanie.evert)

References

The UCS toolkit is a software package for collecting and manipulating co-occurrence data available from http://www.collocations.de/software.html.

UCS relies on compressed text files as its main storage format. They can be exported as a DSM with ucs-tool export-dsm-matrix.

See Also

dsm, read.dsm.triplet


wordspace documentation built on Sept. 9, 2022, 3:04 p.m.