instances_Matrix: Extract term-document matrix from instances

instances_MatrixR Documentation

Extract term-document matrix from instances

Description

Given an instance list, returns a term-document matrix (sparse format).

Usage

instances_Matrix(instances, verbose = getOption("dfrtopics.verbose"))

Arguments

instances

file holding MALLET instances or rJava reference to a MALLET InstanceList object from e.g. read_instances

verbose

if TRUE, give some progress messaging

Details

If the matrix is m, then m[i, j] gives the weight of word i in document j. If another term-weighting is desired, this matrix is convenient to operate on.

For the idea of going sparse, h/t Ben Marwick. The conversion is fairly slow because it involves copying all the corpus data from Java to R and then goes on to commit the Ultimate Sin and use a for loop. Pass verbose=T for some reports on progress. TODO: make smarter.

Value

a sparseMatrix with documents in columns and words in rows. The ordering of the words is as in the vocabulary (instances_vocabulary), and the ordering of documents is as in the instance list (instances_ids).

See Also

sparseMatrix, instances_vocabulary, instances_ids, read_wordcounts for access to unprocessed wordcounts data (i.e. before stopword removal, etc.).


agoldst/dfrtopics documentation built on July 15, 2022, 4:13 p.m.