knitr::opts_chunk$set( collapse = TRUE, comment = "##" )
quanteda.tidy extends the quanteda package with dplyr-style verbs for manipulating corpus objects. These functions operate on document variables (docvars) while preserving the text content and structure of quanteda objects.
Note that quanteda.tidy very different from tidytext. While tidytext converts text to data frames with one token per row, quanteda.tidy keeps your corpus intact and extends dplyr functions to work directly with quanteda objects.
library(quanteda.tidy)
The functions in quanteda.tidy are organized into four categories, following the dplyr documentation:
func_table <- data.frame( Category = c( rep("Rows", 5), rep("Columns", 6), rep("Groups of rows", 2), "Pairs of data frames" ), Function = c( # Rows "`filter()`", "`slice()`, `slice_head()`, `slice_tail()`", "`slice_sample()`", "`slice_min()`, `slice_max()`", "`arrange()`, `distinct()`", # Columns "`select()`", "`rename()`, `rename_with()`", "`relocate()`", "`mutate()`, `transmute()`", "`pull()`", "`glimpse()`", # Groups "`add_count()`", "`add_tally()`", # Pairs "`left_join()`" ), Description = c( # Rows "Subset documents based on docvar conditions", "Subset documents by position", "Randomly sample documents", "Select documents with min/max docvar values", "Reorder documents; keep unique documents", # Columns "Keep or drop docvars by name", "Rename docvars", "Change docvar column order", "Create or modify docvars", "Extract a single docvar as a vector", "Get a quick overview of the corpus", # Groups "Add count by group as a docvar", "Add total count as a docvar", # Pairs "Join corpus with external data frame" ) ) knitr::kable(func_table, caption = "quanteda.tidy functions by category")
These functions subset, reorder, or select documents based on their document variables or positions.
Use filter() to keep documents that match specified conditions:
# Keep only Roosevelt's speeches data_corpus_inaugural %>% filter(President == "Roosevelt") %>% summary()
Use slice() and its variants to select documents by position:
# First 3 documents slice(data_corpus_inaugural, 1:3) # First 10% slice_head(data_corpus_inaugural, prop = 0.10) # Last 3 documents slice_tail(data_corpus_inaugural, n = 3)
Random sampling:
set.seed(42) slice_sample(data_corpus_inaugural, n = 5)
Select by minimum or maximum values of a docvar:
# Add token counts first corp <- data_corpus_inaugural %>% mutate(n_tokens = ntoken(data_corpus_inaugural)) # Shortest speeches slice_min(corp, n_tokens, n = 3) # Longest speeches slice_max(corp, n_tokens, n = 3)
Use arrange() to reorder documents:
# Sort alphabetically by president data_corpus_inaugural[1:5] %>% arrange(President) # Sort by year descending data_corpus_inaugural[1:5] %>% arrange(desc(Year))
Use distinct() to keep only unique combinations of docvar values:
# Keep first document for each president data_corpus_inaugural %>% distinct(President, .keep_all = TRUE) %>% summary(n = 10)
These functions create, modify, rename, reorder, or select document variables.
Use select() to keep or drop docvars:
data_corpus_inaugural %>% select(President, Year) %>% summary(n = 5)
Use rename() for direct renaming:
data_corpus_inaugural %>% rename(LastName = President, Given = FirstName) %>% summary(n = 5)
Use rename_with() to rename using a function:
data_corpus_inaugural %>% rename_with(toupper) %>% summary(n = 5)
Use relocate() to change column order:
data_corpus_inaugural %>% relocate(Party, President) %>% summary(n = 5)
Use mutate() to add new docvars or modify existing ones:
data_corpus_inaugural %>% mutate( fullname = paste(FirstName, President, sep = " "), century = floor(Year / 100) + 1 ) %>% summary(n = 5)
Use transmute() to create new docvars and drop all others:
data_corpus_inaugural %>% transmute( speech_id = paste(Year, President, sep = "-"), party = Party ) %>% summary(n = 5)
Use pull() to extract a single docvar as a vector:
data_corpus_inaugural %>% filter(Year >= 2000) %>% pull(President)
Use glimpse() (from tibble) to see a compact summary:
glimpse(data_corpus_inaugural)
These functions compute summaries or add variables based on groups.
Use add_count() to add a count variable by group:
# Count speeches per president data_corpus_inaugural %>% add_count(President, name = "n_speeches") %>% filter(n_speeches > 1) %>% summary(n = 10)
Use add_tally() to add the total count:
data_corpus_inaugural %>% slice(1:5) %>% add_tally() %>% summary()
These functions combine a corpus with an external data frame.
Use left_join() to add columns from a data frame to your corpus:
# Create some external data party_colors <- data.frame( Party = c("Democratic", "Republican", "none", "Federalist", "Democratic-Republican", "Whig"), color = c("blue", "red", "gray", "purple", "green", "orange") ) # Join to corpus data_corpus_inaugural %>% left_join(party_colors, by = "Party") %>% summary(n = 10)
left_join() provides special handling for joining on document names. Use
"docname" in the by argument to match on document names even when
"docname" is not a docvar:
# Create data with document name as key doc_metadata <- data.frame( docname = c("1789-Washington", "1793-Washington", "1797-Adams"), notes = c("First inaugural", "Second inaugural", "First Adams speech") ) # Join using docname data_corpus_inaugural[1:5] %>% left_join(doc_metadata, by = "docname") %>% summary()
You can also match document names to a differently-named column:
doc_metadata2 <- data.frame( doc_id = c("1789-Washington", "1793-Washington"), rating = c(5, 4) ) data_corpus_inaugural[1:5] %>% left_join(doc_metadata2, by = c("docname" = "doc_id")) %>% summary()
All quanteda.tidy functions work seamlessly with the pipe operator, allowing you to chain multiple operations:
data_corpus_inaugural %>% # Add metadata mutate( decade = floor(Year / 10) * 10, n_tokens = ntoken(data_corpus_inaugural) ) %>% # Filter to 20th century filter(Year >= 1900, Year < 2000) %>% # Keep only relevant columns select(President, Party, decade, n_tokens) %>% # Sort by speech length arrange(desc(n_tokens)) %>% summary(n = 10)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.