q_dtm: Quick DocumentTermMatrix

Description Usage Arguments Value See Also Examples

Description

Make a DocumentTermMatrix from a vector of text and and optional vector of documents. To stem a document as well use the q_dtm_stem version of q_dtm which uses SnowballC's wordStem.

Usage

1
2
3
4
5
q_dtm(text, docs = seq_along(text), to = "tm", keep.hyphen = FALSE,
  ngrams = NULL, ...)

q_dtm_stem(text, docs = seq_along(text), to = "tm", keep.hyphen = FALSE,
  ngrams = NULL, ...)

Arguments

text

A vector of strings.

docs

A vector of document names.

to

target conversion format, consisting of the name of the package into whose document-term matrix representation the dfm will be converted:

"lda"

a list with components "documents" and "vocab" as needed by lda.collapsed.gibbs.sampler from the lda package

"tm"

a DocumentTermMatrix from the tm package

"stm"

the format for the stm package

"austin"

the wfm format from the austin package

"topicmodels"

the "dtm" format as used by the topicmodels package

keep.hyphen

logical. If TRUE hyphens are retained in the terms (e.g., "math-like" is kept as "math-like"), otherwise they become a split for terms (e.g., "math-like" is converted to "math" & "like").

ngrams

A vector of ngrams (multiple wrds with spaces). Using this option results in the ngrams that will be retained in the matrix.

...

Additional arguments passed to dfm.

Value

Returns a DocumentTermMatrix.

See Also

dfm, convert

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
(x <- with(presidential_debates_2012, q_dtm(dialogue, paste(time, tot, sep = "_"))))
tm::weightTfIdf(x)

(x2 <- with(presidential_debates_2012, q_dtm_stem(dialogue, paste(time, tot, sep = "_"))))
remove_stopwords(x2, stem=TRUE)

bigrams <- c('make sure', 'governor romney', 'mister president',
    'united states', 'middle class', 'middle east', 'health care',
    'american people', 'dodd frank', 'wall street', 'small business')

grep(" ", x$dimnames$Terms, value = TRUE) #no ngrams

(x3 <- with(presidential_debates_2012,
    q_dtm(dialogue, paste(time, tot, sep = "_"), ngrams = bigrams)
))

grep(" ", x3$dimnames$Terms, value = TRUE) #ngrams

gofastr documentation built on May 2, 2019, 5:39 a.m.