dtm_resampler: Resamples an input DTM to generate new DTMs

View source: R/utils-dtm.R

dtm_resamplerR Documentation

Resamples an input DTM to generate new DTMs

Description

Takes any DTM and randomly resamples from each row, creating a new DTM

Usage

dtm_resampler(dtm, alpha = NULL, n = NULL)

Arguments

dtm

Document-term matrix with terms as columns. Works with DTMs produced by any popular text analysis package, or you can use the dtm_builder() function.

alpha

Number indicating proportion of document lengths, e.g., alpha = 1 returns resampled rows that are the same lengths as the original DTM.

n

Integer indicating the length of documents to be returned, e.g., n = 100L will bring documents shorter than 100 tokens up to 100, while bringing documents longer than 100 tokens down to 100.

Details

Using the row counts as probabilities, each document's tokens are resampled with replacement up to a certain proportion of the row count (set by alpha). This function can be used with iteration to "bootstrap" a DTM without returning to the raw text. It does not iterate, however, so operations can be performed on one DTM at a time without storing multiple DTMs in memory.

If alpha is less than 1, then a proportion of each documents' lengths is returned. For example, alpha = 0.50 will return a resampled DTM where each row has half the tokens of the original DTM. If alpha = 2, than each row in the resampled DTM twice the number of tokens of the original DTM. If an integer is provided to n then all documents will be resampled to that length. For example, n = 2000L will resample each document until they are 2000 tokens long – meaning those shorter than 2000 will be increased in length, while those longer than 2000 will be decreased in length. alpha and n should not be specified at the same time.

Value

returns a document-term matrix of class "dgCMatrix"


text2map documentation built on July 9, 2023, 6:35 p.m.