convert a dfm to a non-quanteda format

Share:

Description

Convert a quanteda dfm-class object to a format useable by other text analysis packages. The general function convert provides easy conversion from a dfm to the document-term representations used in all other text analysis packages for which conversions are defined. To make the usage as consistent as possible with other packages, however, quanteda also provides direct conversion functions in the idiom of the foreign packages, for example as.wfm to coerce a dfm into the wfm format from the austin package, and quantedaformat2dtm for using a dfm with the topicmodels package.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
convert(x, to, ...)

## S3 method for class 'dfm'
convert(x, to = c("lda", "tm", "stm", "austin", "topicmodels"),
  docvars = NULL, ...)

as.wfm(x)

## S3 method for class 'dfm'
as.wfm(x)

as.DocumentTermMatrix(x, ...)

## S3 method for class 'dfm'
as.DocumentTermMatrix(x, ...)

dfm2ldaformat(x)

## S3 method for class 'dfm'
dfm2ldaformat(x)

quantedaformat2dtm(x)

## S3 method for class 'dfm'
quantedaformat2dtm(x)

Arguments

x

dfm to be converted

to

target conversion format, consisting of the name of the package into whose document-term matrix representation the dfm will be converted:

"lda"

a list with components "documents" and "vocab" as needed by lda.collapsed.gibbs.sampler from the lda package

"tm"

a DocumentTermMatrix from the tm package

"stm"

the format for the stm package

"austin"

the wfm format from the austin package

"topicmodels"

the "dtm" format as used by the topicmodels package

...

not used here

docvars

optional data.frame of document variables used as the meta information in conversion to the STM package format. This aids in selecting the document variables only corresponding to the documents with non-zero counts.

Details

We recommend using convert() rather than the specific functions. In fact, it's worth considering whether we should simply remove all of them and only support calling these through 'convert()'.

We may also use this function, eventually, for converting other classes of objects such as a 'corpus' or 'tokenizedList'.

as.wfm converts a quanteda dfm into the wfm format used by the austin package.

as.DocumentTermMatrix will convert a quanteda dfm into the tm package's DocumentTermMatrix format.

dfm2ldaformat provides converts a dfm into the list representation of terms in documents used by tghe lda package.

quantedaformat2dtm provides converts a dfm into the sparse simple triplet matrix representation of terms in documents used by the topicmodels package.

Value

A converted object determined by the value of to (see above). See conversion target package documentation for more detailed descriptions of the return formats.

dfm2ldaformat returns a list with components "documents" and "vocab" as needed by lda.collapsed.gibbs.sampler.

quantedaformat2dtm returns a "dtm" sparse matrix object for use with the topicmodels package.

Note

The tm package version of as.TermDocumentMatrix allows a weighting argument, which supplies a weighting function for TermDocumentMatrix. Here the default is for term frequency weighting. If you want a different weighting, apply the weights after converting using one of the tm functions.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
mycorpus <- subset(inaugCorpus, Year > 1970)
quantdfm <- dfm(mycorpus, verbose = FALSE)

# austin's wfm format
austindfm <- as.wfm(quantdfm)
identical(austindfm, convert(quantdfm, to = "austin"))

# tm's DocumentTermMatrix format
tmdfm <- as.DocumentTermMatrix(quantdfm)
str(tmdfm)

# stm package format
stmdfm <- convert(quantdfm, to = "stm")
str(stmdfm)
# illustrate what happens with zero-length documents
quantdfm2 <- dfm(c(punctOnly = "!!!", mycorpus[-1]), verbose = FALSE)
rowSums(quantdfm2)
stmdfm2 <- convert(quantdfm2, to = "stm", docvars = docvars(mycorpus))
str(stmdfm2)
 
# topicmodels package format
topicmodelsdfm <- quantedaformat2dtm(quantdfm)
identical(topicmodelsdfm, convert(quantdfm, to = "topicmodels"))

# lda package format
ldadfm <- convert(quantdfm, to = "lda")
str(ldadfm)
identical(ldadfm[1], stmdfm[1])

# calling dfm2ldaformat directly
ldadfm <- dfm2ldaformat(quantdfm)
str(ldadfm)

Want to suggest features or report bugs for rdrr.io? Use the GitHub issue tracker.