left_join.corpus: Join corpus with a data frame

View source: R/left_join.R

left_join.corpusR Documentation

Join corpus with a data frame

Description

left_join() adds columns from y to the corpus x, matching documents based on document variables. This is a mutating join that keeps all documents from x and adds matching values from y. If a document in x has no match in y, the new columns will contain NA.

Usage

## S3 method for class 'corpus'
left_join(
  x,
  y,
  by = NULL,
  copy = FALSE,
  suffix = c(".x", ".y"),
  ...,
  keep = NULL
)

Arguments

x

a quanteda corpus object

y

a data frame or tibble to join

by

a join specification. See dplyr::left_join() for details. Defaults to natural join using all variables with common names. Can use "docname" to join on document names (see Details).

copy

if y is not a data frame or tibble, should it be copied?

suffix

if there are non-joined duplicate variables in x and y, these suffixes will be added to disambiguate

...

other arguments passed to dplyr::left_join()

keep

should the join keys from both x and y be preserved?

Value

a corpus with document variables from both x and y

Special handling of "docname"

This function provides special handling for joining on document names:

  • If by = "docname" (or "docname" appears in the by vector), the function will use docnames(x) as the joining column from the corpus, even if "docname" is not a document variable.

  • If using join_by(docname == other_col), the function will match docnames(x) to other_col in y.

  • If "docname" exists as an actual document variable in x, that variable will be used instead of docnames(x).

Examples

# Create example corpus and data
corp <- data_corpus_inaugural[1:5]

# Create data to join with document names
doc_data <- data.frame(
  docname = c("1789-Washington", "1793-Washington", "1797-Adams"),
  century = c(18, 18, 18),
  speech_number = c(1, 2, 1)
)

# Join using docname - matches docnames(corp) to doc_data$docname
left_join(corp, doc_data, by = "docname") %>%
  summary()

# Join using different column names with named vector
doc_data2 <- data.frame(
  doc_id = c("1789-Washington", "1793-Washington"),
  rating = c(5, 4)
)
left_join(corp, doc_data2, by = c("docname" = "doc_id")) %>%
  summary()

# Regular join on existing docvars
year_info <- data.frame(
  Year = c(1789, 1793, 1797, 1801, 1805),
  decade = c("1780s", "1790s", "1790s", "1800s", "1800s")
)
left_join(corp, year_info, by = "Year") %>%
  summary()


quanteda.tidy documentation built on Dec. 17, 2025, 5:09 p.m.