Getting started with quanteda.tidy

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "##"
)

Introduction

quanteda.tidy extends the quanteda package with dplyr-style verbs for manipulating corpus objects. These functions operate on document variables (docvars) while preserving the text content and structure of quanteda objects.

Note that quanteda.tidy very different from tidytext. While tidytext converts text to data frames with one token per row, quanteda.tidy keeps your corpus intact and extends dplyr functions to work directly with quanteda objects.

library(quanteda.tidy)

Overview of Functions

The functions in quanteda.tidy are organized into four categories, following the dplyr documentation:

func_table <- data.frame(
  Category = c(
    rep("Rows", 5),
    rep("Columns", 6),
    rep("Groups of rows", 2),
    "Pairs of data frames"
  ),
  Function = c(
    # Rows
    "`filter()`", "`slice()`, `slice_head()`, `slice_tail()`",
    "`slice_sample()`", "`slice_min()`, `slice_max()`", "`arrange()`, `distinct()`",
    # Columns
    "`select()`", "`rename()`, `rename_with()`", "`relocate()`",
    "`mutate()`, `transmute()`", "`pull()`", "`glimpse()`",
    # Groups
    "`add_count()`", "`add_tally()`",
    # Pairs
    "`left_join()`"
  ),
  Description = c(
    # Rows
    "Subset documents based on docvar conditions",
    "Subset documents by position",
    "Randomly sample documents",
    "Select documents with min/max docvar values",
    "Reorder documents; keep unique documents",
    # Columns
    "Keep or drop docvars by name",
    "Rename docvars",
    "Change docvar column order",
    "Create or modify docvars",
    "Extract a single docvar as a vector",
    "Get a quick overview of the corpus",
    # Groups
    "Add count by group as a docvar",
    "Add total count as a docvar",
    # Pairs
    "Join corpus with external data frame"
  )
)
knitr::kable(func_table, caption = "quanteda.tidy functions by category")

Verbs That Operate on Rows

These functions subset, reorder, or select documents based on their document variables or positions.

Filtering documents

Use filter() to keep documents that match specified conditions:

# Keep only Roosevelt's speeches
data_corpus_inaugural %>%
  filter(President == "Roosevelt") %>%
  summary()

Slicing documents by position

Use slice() and its variants to select documents by position:

# First 3 documents
slice(data_corpus_inaugural, 1:3)

# First 10%
slice_head(data_corpus_inaugural, prop = 0.10)

# Last 3 documents
slice_tail(data_corpus_inaugural, n = 3)

Random sampling:

set.seed(42)
slice_sample(data_corpus_inaugural, n = 5)

Select by minimum or maximum values of a docvar:

# Add token counts first
corp <- data_corpus_inaugural %>%
  mutate(n_tokens = ntoken(data_corpus_inaugural))

# Shortest speeches
slice_min(corp, n_tokens, n = 3)

# Longest speeches
slice_max(corp, n_tokens, n = 3)

Arranging documents

Use arrange() to reorder documents:

# Sort alphabetically by president
data_corpus_inaugural[1:5] %>%
  arrange(President)

# Sort by year descending
data_corpus_inaugural[1:5] %>%
  arrange(desc(Year))

Keeping distinct documents

Use distinct() to keep only unique combinations of docvar values:

# Keep first document for each president
data_corpus_inaugural %>%
  distinct(President, .keep_all = TRUE) %>%
  summary(n = 10)

Verbs That Operate on Columns

These functions create, modify, rename, reorder, or select document variables.

Selecting docvars

Use select() to keep or drop docvars:

data_corpus_inaugural %>%
  select(President, Year) %>%
  summary(n = 5)

Renaming docvars

Use rename() for direct renaming:

data_corpus_inaugural %>%
  rename(LastName = President, Given = FirstName) %>%
  summary(n = 5)

Use rename_with() to rename using a function:

data_corpus_inaugural %>%
  rename_with(toupper) %>%
  summary(n = 5)

Relocating docvars

Use relocate() to change column order:

data_corpus_inaugural %>%
  relocate(Party, President) %>%
  summary(n = 5)

Creating and modifying docvars

Use mutate() to add new docvars or modify existing ones:

data_corpus_inaugural %>%
  mutate(
    fullname = paste(FirstName, President, sep = " "),
    century = floor(Year / 100) + 1
  ) %>%
  summary(n = 5)

Use transmute() to create new docvars and drop all others:

data_corpus_inaugural %>%
  transmute(
    speech_id = paste(Year, President, sep = "-"),
    party = Party
  ) %>%
  summary(n = 5)

Extracting docvars

Use pull() to extract a single docvar as a vector:

data_corpus_inaugural %>%
  filter(Year >= 2000) %>%
  pull(President)

Getting an overview

Use glimpse() (from tibble) to see a compact summary:

glimpse(data_corpus_inaugural)

Verbs That Operate on Groups of Rows

These functions compute summaries or add variables based on groups.

Counting observations

Use add_count() to add a count variable by group:

# Count speeches per president
data_corpus_inaugural %>%
  add_count(President, name = "n_speeches") %>%
  filter(n_speeches > 1) %>%
  summary(n = 10)

Use add_tally() to add the total count:

data_corpus_inaugural %>%
  slice(1:5) %>%
  add_tally() %>%
  summary()

Verbs That Operate on Pairs of Data Frames

These functions combine a corpus with an external data frame.

Joining with external data

Use left_join() to add columns from a data frame to your corpus:

# Create some external data
party_colors <- data.frame(
  Party = c("Democratic", "Republican", "none", "Federalist",
            "Democratic-Republican", "Whig"),
  color = c("blue", "red", "gray", "purple", "green", "orange")
)

# Join to corpus
data_corpus_inaugural %>%
  left_join(party_colors, by = "Party") %>%
  summary(n = 10)

Special handling of document names

left_join() provides special handling for joining on document names. Use "docname" in the by argument to match on document names even when "docname" is not a docvar:

# Create data with document name as key
doc_metadata <- data.frame(
  docname = c("1789-Washington", "1793-Washington", "1797-Adams"),
  notes = c("First inaugural", "Second inaugural", "First Adams speech")
)

# Join using docname
data_corpus_inaugural[1:5] %>%
  left_join(doc_metadata, by = "docname") %>%
  summary()

You can also match document names to a differently-named column:

doc_metadata2 <- data.frame(
  doc_id = c("1789-Washington", "1793-Washington"),
  rating = c(5, 4)
)

data_corpus_inaugural[1:5] %>%
  left_join(doc_metadata2, by = c("docname" = "doc_id")) %>%
  summary()

Piping Operations

All quanteda.tidy functions work seamlessly with the pipe operator, allowing you to chain multiple operations:

data_corpus_inaugural %>%
  # Add metadata
  mutate(
    decade = floor(Year / 10) * 10,
    n_tokens = ntoken(data_corpus_inaugural)
  ) %>%
  # Filter to 20th century

  filter(Year >= 1900, Year < 2000) %>%
  # Keep only relevant columns
  select(President, Party, decade, n_tokens) %>%
  # Sort by speech length

  arrange(desc(n_tokens)) %>%
  summary(n = 10)


Try the quanteda.tidy package in your browser

Any scripts or data that you put into this service are public.

quanteda.tidy documentation built on Dec. 17, 2025, 5:09 p.m.