Plot documents, words and tokens removed at various word thresholds

Description

A plot function which shows the results of using different thresholds in prepDocuments on the size of the corpus.

Usage

1
plotRemoved(documents, lower.thresh)

Arguments

documents

The documents to be used for the stm model

lower.thresh

A vector of integers, each of which will be tested as a lower threshold for the prepDocuments function.

Details

For a lower threshold, prepDocuments will drop words which appear in fewer than that number of documents, and remove documents which contain no more words. This function allows the user to pass a vector of lower thresholds and observe how prepDocuments will handle each threshold. This function produces three plots, showing the number of words, the number of documents, and the total number of tokens removed as a function of threshold values. A dashed red line is plotted at the total number of documents, words and tokens respectively.

Value

Invisibly returns a list of

lower.thresh

The sorted threshold values

ndocs

The number of documents dropped for each value of the lower threshold

nwords

The number of entries of the vocab dropped for each value of the lower threshold.

ntokens

The number of tokens dropped for each value of the lower threshold.

See Also

prepDocuments

Examples

1
2
3
4
## Not run: 
plotRemoved(poliblog5k.docs, lower.thresh=seq(from = 10, to = 1000, by = 10))

## End(Not run)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.