convert: Convert a dfm to a non-quanteda format

Description Usage Arguments Value Examples

View source: R/convert.R

Description

Convert a quanteda dfm object to a format useable by other text analysis packages. The general function convert provides easy conversion from a dfm to the document-term representations used in all other text analysis packages for which conversions are defined.

Usage

1
2
convert(x, to = c("lda", "tm", "stm", "austin", "topicmodels", "lsa",
  "matrix", "data.frame"), docvars = NULL)

Arguments

x

a dfm to be converted

to

target conversion format, consisting of the name of the package into whose document-term matrix representation the dfm will be converted:

"lda"

a list with components "documents" and "vocab" as needed by the function lda.collapsed.gibbs.sampler from the lda package

"tm"

a DocumentTermMatrix from the tm package

"stm"

the format for the stm package

"austin"

the wfm format from the austin package

"topicmodels"

the "dtm" format as used by the topicmodels package

"lsa"

the "textmatrix" format as used by the lsa package

docvars

optional data.frame of document variables used as the meta information in conversion to the stm package format. This aids in selecting the document variables only corresponding to the documents with non-zero counts.

Value

A converted object determined by the value of to (see above). See conversion target package documentation for more detailed descriptions of the return formats.

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
mycorpus <- corpus_subset(data_corpus_inaugural, Year > 1970)
quantdfm <- dfm(mycorpus, verbose = FALSE)

# austin's wfm format
identical(dim(quantdfm), dim(convert(quantdfm, to = "austin")))

# stm package format
stmdfm <- convert(quantdfm, to = "stm")
str(stmdfm)
# illustrate what happens with zero-length documents
quantdfm2 <- dfm(c(punctOnly = "!!!", mycorpus[-1]), verbose = FALSE)
rowSums(quantdfm2)
stmdfm2 <- convert(quantdfm2, to = "stm", docvars = docvars(mycorpus))
str(stmdfm2)
 
## Not run: 
# tm's DocumentTermMatrix format
tmdfm <- convert(quantdfm, to = "tm")
str(tmdfm)

# topicmodels package format
str(convert(quantdfm, to = "topicmodels"))

# lda package format
ldadfm <- convert(quantdfm, to = "lda")
str(ldadfm)

## End(Not run)

Example output

quanteda version 0.99
Using 2 of 1 threads for parallel computing

Attaching package: 'quanteda'

The following object is masked from 'package:utils':

    View

[1] TRUE
List of 3
 $ documents:List of 12
  ..$ 1973-Nixon  : int [1:2, 1:515] 2 2 6 96 7 34 8 69 16 1 ...
  ..$ 1977-Carter : int [1:2, 1:501] 2 4 4 1 5 1 6 65 7 18 ...
  ..$ 1981-Reagan : int [1:2, 1:850] 2 20 6 174 7 19 8 130 15 1 ...
  ..$ 1985-Reagan : int [1:2, 1:876] 2 6 3 1 4 1 5 1 6 177 ...
  ..$ 1989-Bush   : int [1:2, 1:756] 2 6 6 166 7 15 8 142 21 2 ...
  ..$ 1993-Clinton: int [1:2, 1:605] 2 4 6 139 8 81 25 1 36 5 ...
  ..$ 1997-Clinton: int [1:2, 1:726] 2 4 6 131 7 26 8 108 14 1 ...
  ..$ 2001-Bush   : int [1:2, 1:592] 2 2 6 110 7 4 8 96 26 1 ...
  ..$ 2005-Bush   : int [1:2, 1:735] 2 6 3 1 6 120 7 2 8 98 ...
  ..$ 2009-Obama  : int [1:2, 1:900] 1 1 2 2 6 130 7 44 8 118 ...
  ..$ 2013-Obama  : int [1:2, 1:786] 6 99 7 13 8 89 10 1 21 1 ...
  ..$ 2017-Trump  : int [1:2, 1:547] 2 2 3 1 6 96 7 11 8 88 ...
 $ vocab    : chr [1:3462] "!" "\"" "'" "(" ...
 $ meta     :'data.frame':	12 obs. of  3 variables:
  ..$ Year     : num [1:12] 1973 1977 1981 1985 1989 ...
  ..$ President: chr [1:12] "Nixon" "Carter" "Reagan" "Reagan" ...
  ..$ FirstName: chr [1:12] "Richard Milhous" "Jimmy" "Ronald" "Ronald" ...
   punctOnly  1977-Carter  1981-Reagan  1985-Reagan    1989-Bush 1993-Clinton 
           3         1376         2790         2921         2681         1833 
1997-Clinton    2001-Bush    2005-Bush   2009-Obama   2013-Obama   2017-Trump 
        2449         1808         2319         2711         2317         1660 
List of 3
 $ documents:List of 12
  ..$ punctOnly   : int [1:2, 1] 1 3
  ..$ 1977-Carter : int [1:2, 1:501] 2 4 4 1 5 1 6 65 7 18 ...
  ..$ 1981-Reagan : int [1:2, 1:850] 2 20 6 174 7 19 8 130 15 1 ...
  ..$ 1985-Reagan : int [1:2, 1:876] 2 6 3 1 4 1 5 1 6 177 ...
  ..$ 1989-Bush   : int [1:2, 1:756] 2 6 6 166 7 15 8 142 20 2 ...
  ..$ 1993-Clinton: int [1:2, 1:605] 2 4 6 139 8 81 23 1 34 5 ...
  ..$ 1997-Clinton: int [1:2, 1:726] 2 4 6 131 7 26 8 108 14 1 ...
  ..$ 2001-Bush   : int [1:2, 1:592] 2 2 6 110 7 4 8 96 24 1 ...
  ..$ 2005-Bush   : int [1:2, 1:735] 2 6 3 1 6 120 7 2 8 98 ...
  ..$ 2009-Obama  : int [1:2, 1:900] 1 1 2 2 6 130 7 44 8 118 ...
  ..$ 2013-Obama  : int [1:2, 1:786] 6 99 7 13 8 89 10 1 20 1 ...
  ..$ 2017-Trump  : int [1:2, 1:547] 2 2 3 1 6 96 7 11 8 88 ...
 $ vocab    : chr [1:3376] "!" "\"" "'" "(" ...
 $ meta     :'data.frame':	12 obs. of  3 variables:
  ..$ Year     : num [1:12] 1973 1977 1981 1985 1989 ...
  ..$ President: chr [1:12] "Nixon" "Carter" "Reagan" "Reagan" ...
  ..$ FirstName: chr [1:12] "Richard Milhous" "Jimmy" "Ronald" "Ronald" ...
List of 6
 $ i       : int [1:8389] 1 3 5 9 11 1 2 3 4 5 ...
 $ j       : int [1:8389] 1 1 1 1 1 2 2 2 2 2 ...
 $ v       : num [1:8389] 3 3 6 1 1 69 52 130 124 142 ...
 $ nrow    : int 12
 $ ncol    : int 3462
 $ dimnames:List of 2
  ..$ Docs : chr [1:12] "1973-Nixon" "1977-Carter" "1981-Reagan" "1985-Reagan" ...
  ..$ Terms: chr [1:3462] "mr" "." "vice" "president" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
List of 6
 $ i       : int [1:8389] 1 1 1 1 1 1 1 1 1 1 ...
 $ j       : int [1:8389] 1 2 3 4 5 6 7 8 9 10 ...
 $ v       : int [1:8389] 3 69 1 1 96 1 1 1 1 1 ...
 $ nrow    : int 12
 $ ncol    : int 3462
 $ dimnames:List of 2
  ..$ Docs : chr [1:12] "1973-Nixon" "1977-Carter" "1981-Reagan" "1985-Reagan" ...
  ..$ Terms: chr [1:3462] "mr" "." "vice" "president" ...
 - attr(*, "class")= chr [1:2] "DocumentTermMatrix" "simple_triplet_matrix"
 - attr(*, "weighting")= chr [1:2] "term frequency" "tf"
List of 2
 $ documents:List of 12
  ..$ 1973-Nixon  : int [1:2, 1:515] 0 3 1 69 2 1 3 1 4 96 ...
  ..$ 1977-Carter : int [1:2, 1:501] 1 52 3 3 4 65 7 1 12 48 ...
  ..$ 1981-Reagan : int [1:2, 1:850] 0 3 1 130 2 2 3 5 4 174 ...
  ..$ 1985-Reagan : int [1:2, 1:876] 1 124 2 1 3 3 4 177 5 1 ...
  ..$ 1989-Bush   : int [1:2, 1:756] 0 6 1 142 2 1 3 6 4 166 ...
  ..$ 1993-Clinton: int [1:2, 1:605] 1 81 3 2 4 139 12 66 13 7 ...
  ..$ 1997-Clinton: int [1:2, 1:726] 1 108 3 1 4 131 7 1 12 94 ...
  ..$ 2001-Bush   : int [1:2, 1:592] 1 96 2 1 3 3 4 110 7 3 ...
  ..$ 2005-Bush   : int [1:2, 1:735] 0 1 1 98 2 1 3 4 4 120 ...
  ..$ 2009-Obama  : int [1:2, 1:900] 1 118 3 1 4 130 12 111 13 2 ...
  ..$ 2013-Obama  : int [1:2, 1:786] 0 1 1 89 2 1 3 2 4 99 ...
  ..$ 2017-Trump  : int [1:2, 1:547] 1 88 3 5 4 96 6 1 7 1 ...
 $ vocab    : chr [1:3462] "mr" "." "vice" "president" ...

quanteda documentation built on April 16, 2018, 1:04 a.m.