sparse_to_stm: Convert sparse Matrix to format required by stm for modelling
In manuelbickel/textility: Utility functions for text mining

sparse_to_stm

R Documentation

Convert sparse Matrix to format required by stm for modelling

Description

stm has a readCorpus function that does the same, however, it may choke on large matrices. Hence, this function is simply a more memory efficent version for sparseMatrix input using text2vec::as.lda_c for conversion with slight adaptions to make output fit to stm requirements in terms of document indices.

Usage

sparse_to_stm(x, keep_rownames = TRUE)

Arguments

`x`	A `sparseMatrix`.
`keep_rownames`	By default TRUE, documents are named according to the rownames of `x`. When set to FALSE, document names are `NULL`.

Value

A list y of 2 items, y$documents are documents represented similar to lda_c format, but vocabulary indices start with 1 instead of 0) and y$vocab containing the vocabulary (i.e. orignal colnames of x).

Examples


library(text2vec)
library(stm)
data("movie_review")
it = itoken(substr(movie_review$review[1:3], 1, 50), preprocess_function = tolower,
           tokenizer = word_tokenizer)
v = create_vocabulary(it)
vectorizer = vocab_vectorizer(v)
it = itoken(movie_review$review[1:3], preprocess_function = tolower,
           tokenizer = word_tokenizer)
dtm = create_dtm(it, vectorizer)
all.equal(textility::sparse_to_stm(dtm), stm::readCorpus(dtm))
#[1] TRUE

manuelbickel/textility documentation built on Nov. 25, 2022, 9:07 p.m.