data_store: Data Structure for 'hclusttext'
In trinker/clustext: Consistent Clustering for Text Data

Description Usage Arguments Value Examples

A data structure which stores the text, DocumentTermMatrix, and information regarding removed text elements which can not be handled by the hierarchical_cluster function. This structure is required because it documents important meta information, including removed elements, required by other clustext functions. If the user wishes to combine documents (say by a common grouping variable) it is recomended this be handled by combine prior to using data_store.

1
2
3

data_store(text, doc.names, min.term.freq = 1, min.doc.len = 1,
  stopwords = tm::stopwords("english"), min.char = 3, max.char = NULL,
  stem = FALSE, denumber = TRUE)

`text`	A character vector.
`doc.names`	An optional vector of document names corresponding to the length of `text`.
`min.term.freq`	The minimum times a term must appear to be included in the `DocumentTermMatrix`.
`min.doc.len`	The minimum words a document must contain to be included in the data structure (other wise it is stored as a `removed` element).
`stopwords`	A vector of stopwords to remove.
`min.char`	The minial length character for retained words.
`max.char`	The maximum length character for retained words.
`stem`	Logical. If `TRUE` the `stopwords` will be stemmed.
`denumber`	Logical. If `TRUE` numbers will be excluded.

Returns a list containing:

dtm: A tf-idf weighted DocumentTermMatrix
text: The text vector with unanalyzable elements removed
removed: The indices of the removed text elements, i.e., documents not meeting min.doc.len
n.nonsparse: The length of the non-zero elements

data_store(presidential_debates_2012[["dialogue"]])

## Use `combine` to merge text prior to `data_stare`
library(textshape)
library(dplyr)

dat <- presidential_debates_2012 %>%
    dplyr::select(person, time, dialogue) %>%
    textshape::combine()

## Elements in `ds` correspond to `dat` grouping vars
(ds <- with(dat, data_store(dialogue)))
dplyr::select(dat, -3)

## Add row names
(ds2 <- with(dat, data_store(dialogue, paste(person, time, sep = "_"))))
rownames(ds2[["dtm"]])

## Get a DocumentTermMatrix
get_dtm(ds2)