foreign: Read Document-Term Matrices

Description Usage Arguments Details Value See Also


Read document-term matrices stored in special file formats.


read_dtm_Blei_et_al(file, vocab = NULL)
read_dtm_MC(file, scalingtype = NULL)



a character string with the name of the file to read.


a character string with the name of a vocabulary file (giving the terms, one per line), or NULL.


a character string specifying the type of scaling to be used, or NULL (default), in which case the scaling will be inferred from the names of the files with non-zero entries found (see Details).


read_dtm_Blei_et_al reads the (List of Lists type sparse matrix) format employed by the Latent Dirichlet Allocation and Correlated Topic Model C codes by Blei et al (

MC is a toolkit for creating vector models from text documents (see It employs a variant of Compressed Column Storage (CCS) sparse matrix format, writing data into several files with suitable names: e.g., a file with ‘_dim’ appended to the base file name stores the matrix dimensions. The non-zero entries are stored in a file the name of which indicates the scaling type used: e.g., ‘_tfx_nz’ indicates scaling by term frequency (t), inverse document frequency (f) and no normalization (x). See ‘README’ in the MC sources for more information.

read_dtm_MC reads such sparse matrix information with argument file giving the path with the base file name.


A document-term matrix.

See Also

read_stm_MC in package slam.

tm documentation built on July 12, 2020, 3 p.m.