convert: Convert quanteda objects to non-quanteda formats
In quanteda: Quantitative Analysis of Textual Data

convert

R Documentation

Convert quanteda objects to non-quanteda formats

Description

Convert a quanteda dfm or corpus object to a format useable by other packages. The general function convert provides easy conversion from a dfm to the document-term representations used in all other text analysis packages for which conversions are defined. For corpus objects, convert provides an easy way to make a corpus and its document variables into a data.frame.

Usage

convert(x, to, ...)

## S3 method for class 'dfm'
convert(
  x,
  to = c("lda", "tm", "stm", "austin", "topicmodels", "lsa", "matrix", "data.frame",
    "tripletlist"),
  docvars = NULL,
  omit_empty = TRUE,
  docid_field = "doc_id",
  ...
)

## S3 method for class 'corpus'
convert(x, to = c("data.frame", "json"), pretty = FALSE, ...)

Arguments

`x`	a dfm or corpus to be converted
`to`	target conversion format, one of: `"lda"` a list with components "documents" and "vocab" as needed by the function lda.collapsed.gibbs.sampler from the lda package `"tm"` a DocumentTermMatrix from the tm package. Note: The tm package version of `as.TermDocumentMatrix()` allows a `weighting` argument, which supplies a weighting function for `TermDocumentMatrix()`. Here the default is for term frequency weighting. If you want a different weighting, apply the weights after converting using one of the tm functions. For other available weighting functions from the tm package, see TermDocumentMatrix. `"stm"` the format for the stm package `"austin"` the `wfm` format from the austin package `"topicmodels"` the "dtm" format as used by the topicmodels package `"lsa"` the "textmatrix" format as used by the lsa package `"data.frame"` a data.frame of without row.names, in which documents are rows, and each feature is a variable (for a dfm), or each text and its document variables form a row (for a corpus) `"json"` (corpus only) convert a corpus and its document variables into JSON format, using the format described in jsonlite::toJSON() `"tripletlist"` a named "triplet" format list consisting of `document`, `feature`, and `frequency`
`...`	unused directly
`docvars`	optional data.frame of document variables used as the `meta` information in conversion to the stm package format. This aids in selecting the document variables only corresponding to the documents with non-zero counts. Only affects the "stm" format.
`omit_empty`	logical; if `TRUE`, omit empty documents and features from the converted dfm. This is required for some formats (such as STM) that do not accept empty documents. Only used when `to = "lda"` or `to = "topicmodels"`. For `to = "stm"` format, `omit_empty` is always `TRUE`.
`docid_field`	character; the name of the column containing document names used when `to = "data.frame"`. Unused for other conversions.
`pretty`	adds indentation whitespace to JSON output. Can be TRUE/FALSE or a number specifying the number of spaces to indent (default is 2). Use a negative number for tabs instead of spaces.

Value

A converted object determined by the value of to (see above). See conversion target package documentation for more detailed descriptions of the return formats.

Examples

## convert a dfm

toks <- corpus_subset(data_corpus_inaugural, Year > 1970) |>
    tokens()
dfmat1 <- dfm(toks)

# austin's wfm format
identical(dim(dfmat1), dim(convert(dfmat1, to = "austin")))

# stm package format
stmmat <- convert(dfmat1, to = "stm")
str(stmmat)

# triplet
tripletmat <- convert(dfmat1, to = "tripletlist")
str(tripletmat)

## Not run: 
# tm's DocumentTermMatrix format
tmdfm <- convert(dfmat1, to = "tm")
str(tmdfm)

# topicmodels package format
str(convert(dfmat1, to = "topicmodels"))

# lda package format
str(convert(dfmat1, to = "lda"))

## End(Not run)

## convert a corpus into a data.frame

corp <- corpus(c(d1 = "Text one.", d2 = "Text two."),
               docvars = data.frame(dvar1 = 1:2, dvar2 = c("one", "two"),
                                    stringsAsFactors = FALSE))
convert(corp, to = "data.frame")
convert(corp, to = "json")

quanteda documentation built on June 8, 2025, 9:41 p.m.