dfm_group: Combine documents in a dfm by a grouping variable
In quanteda: Quantitative Analysis of Textual Data

dfm_group

R Documentation

Combine documents in a dfm by a grouping variable

Description

Combine documents in a dfm by a grouping variable, by summing the cell frequencies within group and creating new "documents" with the group labels.

Usage

dfm_group(
  x,
  groups = docid(x),
  fill = FALSE,
  force = FALSE,
  verbose = quanteda_options("verbose")
)

Arguments

`x`	a dfm
`groups`	grouping variable for sampling, equal in length to the number of documents. This will be evaluated in the docvars data.frame, so that docvars may be referred to by name without quoting. This also changes previous behaviours for `groups`. See `news(Version >= "3.0", package = "quanteda")` for details.
`fill`	logical; if `TRUE` and `groups` is a factor, then use all levels of the factor when forming the new documents of the grouped object. This will result in a new "document" with empty content for levels not observed, but for which an empty document may be needed. If `groups` is a factor of dates, for instance, then `fill = TRUE` ensures that the new object will consist of one new "document" by date, regardless of whether any documents previously existed with that date. Has no effect if the `groups` variable(s) are not factors.
`force`	logical; if `TRUE`, group by summing existing counts, even if the dfm has been weighted. This can result in invalid sums, such as adding log counts (when a dfm has been weighted by `"logcount"` for instance using `dfm_weight()`). Not needed when the term weight schemes "count" and "prop".
`verbose`	if `TRUE` print the number of tokens and documents before and after the function is applied. The number of tokens does not include paddings.

Value

dfm_group returns a dfm whose documents are equal to the unique group combinations, and whose cell values are the sums of the previous values summed by group. Document-level variables that have no variation within groups are saved in docvars. Document-level variables that are lists are dropped from grouping, even when these exhibit no variation within groups.

Examples

corp <- corpus(c("a a b", "a b c c", "a c d d", "a c c d"),
               docvars = data.frame(grp = c("grp1", "grp1", "grp2", "grp2")))
dfmat <- dfm(tokens(corp))
dfm_group(dfmat, groups = grp)
dfm_group(dfmat, groups = c(1, 1, 2, 2))

# with fill = TRUE
dfm_group(dfmat, fill = TRUE,
          groups = factor(c("A", "A", "B", "C"), levels = LETTERS[1:4]))

quanteda documentation built on June 8, 2025, 9:41 p.m.