data_store: Data Structure for 'hclusttext'

Description Usage Arguments Value Examples

Description

A data structure which stores the text, DocumentTermMatrix, and information regarding removed text elements which can not be handled by the hierarchical_cluster function. This structure is required because it documents important meta information, including removed elements, required by other clustext functions. If the user wishes to combine documents (say by a common grouping variable) it is recomended this be handled by combine prior to using data_store.

Usage

1
2
3
data_store(text, doc.names, min.term.freq = 1, min.doc.len = 1,
  stopwords = tm::stopwords("english"), min.char = 3, max.char = NULL,
  stem = FALSE, denumber = TRUE)

Arguments

text

A character vector.

doc.names

An optional vector of document names corresponding to the length of text.

min.term.freq

The minimum times a term must appear to be included in the DocumentTermMatrix.

min.doc.len

The minimum words a document must contain to be included in the data structure (other wise it is stored as a removed element).

stopwords

A vector of stopwords to remove.

min.char

The minial length character for retained words.

max.char

The maximum length character for retained words.

stem

Logical. If TRUE the stopwords will be stemmed.

denumber

Logical. If TRUE numbers will be excluded.

Value

Returns a list containing:

dtm

A tf-idf weighted DocumentTermMatrix

text

The text vector with unanalyzable elements removed

removed

The indices of the removed text elements, i.e., documents not meeting min.doc.len

n.nonsparse

The length of the non-zero elements

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
data_store(presidential_debates_2012[["dialogue"]])

## Use `combine` to merge text prior to `data_stare`
library(textshape)
library(dplyr)

dat <- presidential_debates_2012 %>%
    dplyr::select(person, time, dialogue) %>%
    textshape::combine()

## Elements in `ds` correspond to `dat` grouping vars
(ds <- with(dat, data_store(dialogue)))
dplyr::select(dat, -3)

## Add row names
(ds2 <- with(dat, data_store(dialogue, paste(person, time, sep = "_"))))
rownames(ds2[["dtm"]])

## Get a DocumentTermMatrix
get_dtm(ds2)

trinker/clustext documentation built on May 31, 2019, 8:41 p.m.