Text Mining with gutenbergr and tidytext

#| label: setup
#| include: false
knitr::opts_chunk$set(
  collapse = FALSE,
  comment = "#>",
  fig.width = 7,
  fig.height = 6,
  fig.path = "../man/figures/",
  warning = FALSE,
  message = FALSE
)

This vignette demonstrates a complete text mining workflow using gutenbergr and tidytext. We'll perform an in-depth analysis of Jane Austen's Persuasion, exploring its vocabulary, sentiment, structure, and themes. See Text Mining with R for a great introduction to text mining.

Required Libraries

#| label: windows-check
#| include: false
tryCatch(
  library(gutenbergr),
  error = function(e) {
    # Fallback for Windows check environments
    devtools::load_all("..")
  }
)
#| label: packages
library(dplyr)
library(tidytext)
library(ggplot2)
library(tidyr)
library(stringr)

Download the Book

First, let's find and download Jane Austen's Persuasion:

#| label: find-book
gutenberg_works(str_detect(title, "Persuasion"))

We can see there are multiple works returned. 105 is Persuasion:

#| label: download
#| eval: false
persuasion <- gutenberg_download(105, meta_fields = "title")
#| label: download-sample
#| echo: false
# For vignette building, use sample data
persuasion <- gutenbergr::sample_books |>
  filter(gutenberg_id == 105) |>
  select(gutenberg_id, text, title)
#| label: show-book
persuasion

Structural Analysis: Adding Chapters

Project Gutenberg texts processed into tibbles of lines. To analyze the book's progression, we'll use gutenberg_add_sections(). This function identifies headers and fills them down to create a structural column.

#| label: sections
persuasion <- persuasion |>
  gutenberg_add_sections(
    pattern = "^Chapter [IVXLCDM]+",
    section_col = "chapter",
    format_fn = function(x) {
      x |>
        str_remove("^CHAPTER\\s+") |>
        str_remove("\\.$") |>
        as.roman() |>
        as.numeric()
    }
  )

# Preview the new structure
persuasion |>
  filter(!is.na(chapter)) |>
  head()

Tokenization

We need to move from a one-row-per-line format to a one-row-per-token format. We'll use tidytext::unnest_tokens() to split the text into individual words and remove stop words tidytext::stop_words.

#| label: tokenize
words <- persuasion |>
  unnest_tokens(word, text) |>
  anti_join(stop_words, by = "word")

Word Frequency Analysis

Tokenization makes it trivial to find the most frequent words in the text:

#| label: word-frequency
word_counts <- words |>
  count(word, sort = TRUE)

word_counts

Let's visualize the top 20 words:

#| label: top-words
#| fig.alt: "Horizontal bar chart of the 20 most frequent non-stop words in
#|   Persuasion."
word_counts |>
  slice_max(n, n = 20) |>
  mutate(word = reorder(word, n)) |>
  ggplot(aes(x = n, y = word, fill = word)) +
  geom_col(show.legend = FALSE) +
  labs(
    title = expression(paste("Most Common Words in ", italic("Persuasion"))),
    x = "Frequency",
    y = NULL
  ) +
  theme_minimal()

Character names (Anne, Captain, Elliot, Wentworth) dominate the most frequent words, which makes sense for a character-driven novel.

Sentiment Analysis

Natural language processing uses sentiment analysis to identify emotive/affective states.

Overall Sentiment

Let's use the NRC sentiment lexicon, which classifies words into categories like "joy", "trust", "fear", and "sadness". This will allow us to view the overall sentiment of the book.

Note: The NRC lexicon requires accepting a license agreement during installation and is only free for non-commercial use. The code below shows the analysis workflow, with pre-computed results displayed.

#| label: sentiment-nrc
#| eval: false
nrc_sentiments <- get_sentiments("nrc")

word_sentiments <- words |>
  inner_join(nrc_sentiments, by = "word", relationship = "many-to-many") |>
  count(sentiment, sort = TRUE)

Visualize the distribution of sentiments:

#| label: sentiment-plot
#| eval: false
word_sentiments |>
  mutate(sentiment = reorder(sentiment, n)) |>
  ggplot(aes(x = n, y = sentiment, fill = sentiment)) +
  geom_col(show.legend = FALSE) +
  labs(
    title = expression(paste(
      "Sentiment Distribution in ",
      italic("Persuasion")
    )),
    x = "Word Count",
    y = NULL
  ) +
  theme_minimal()

Bar chart showing the distribution of NRC sentiment categories across all words in Persuasion.

By Chapter

We can aggregate these sentiments by the chapter structure we created earlier.

#| label: nrc-chapters
#| eval: false
nrc_by_chapter <- words |>
  inner_join(nrc_sentiments, by = "word", relationship = "many-to-many") |>
  count(chapter, sentiment) |>
  filter(!is.na(chapter))

nrc_by_chapter |>
  filter(sentiment %in% c("joy", "sadness", "anger", "fear")) |>
  ggplot(aes(x = chapter, y = n, fill = factor(sentiment))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~sentiment, ncol = 2, scales = "free_y") +
  labs(
    title = expression(paste("Sentiment by Chapter in ", italic("Persuasion"))),
    x = "Chapter",
    y = "Word Count"
  ) +
  theme_minimal() +
  theme(
    axis.text.x = element_text(angle = 45, hjust = 1),
    strip.text = element_text(face = "bold")
  )

Four faceted bar charts showing counts of joy, sadness, anger, and fear words per chapter in Persuasion.

Sentiment Progression

We can also see the general emotive content as the book progresses by dividing the text into bins of words to track how specific emotions fluctuate across the narrative arc.

For good measure, let's add another x-axis with chapter labels so we can correlate the sentiment with portions of the narrative.

#| label: nrc-bins
#| eval: false
# Add a running index to preserve order and calculate bins
words_with_index <- words |>
  mutate(word_index = row_number()) |>
  mutate(bin = (word_index - 1) %/% 500 + 1)

nrc_binned <- words_with_index |>
  inner_join(nrc_sentiments, by = "word", relationship = "many-to-many") |>
  count(bin, sentiment)

# Add labels for chapters
chapter_breaks <- words |>
  filter(!is.na(chapter)) |>
  mutate(word_index = row_number()) |>
  group_by(chapter) |>
  slice_min(word_index, n = 1) |>
  ungroup() |>
  mutate(
    bin = (word_index - 1) %/% 500 + 1
  ) |>
  filter(chapter %% 2 == 0)

nrc_binned |>
  filter(sentiment %in% c("joy", "sadness", "anger", "fear")) |>
  ggplot(aes(x = bin, y = n, color = sentiment)) +
  geom_line(linewidth = 1, show.legend = FALSE) +
  facet_wrap(~sentiment, ncol = 2, scales = "free_y") +
  scale_x_continuous(
    name = "Word Bin (500 words)",
    sec.axis = sec_axis(
      ~.,
      breaks = chapter_breaks$bin,
      labels = chapter_breaks$chapter,
      name = "Chapter"
    )
  ) +
  labs(
    title = expression(paste(
      "Sentiment Progression in ",
      italic("Persuasion")
    )),
    subtitle = "NRC sentiments by word bin with chapter reference",
    y = "Word Count"
  ) +
  theme_minimal()

Four line charts tracking joy, sadness, anger, and fear word counts across 500-word bins in Persuasion, with a secondary x-axis showing even chapter numbers.

TF-IDF: Finding Unique Chapter Words

While simple frequency tells us who the main characters are, TF-IDF, or term frequency–inverse document frequency, tells us which words are most important to a specific chapter relative to the rest of the corpus. This is excellent for identifying specific plot points or settings (like the move to Bath or the trip to Lyme).

#| label: tf-idf
#| fig.alt: "Four faceted bar charts showing the highest TF-IDF words for
#|   chapters 10 through 13 of Persuasion."
#| fig.cap: "Highest TF-IDF words for chapters 10–13, showing the most
#|   chapter-distinctive terms relative to the rest of the novel."
chapter_words <- persuasion |>
  unnest_tokens(word, text) |>
  count(chapter, word, sort = TRUE) |>
  bind_tf_idf(word, chapter, n)

# Look at the most "important" words for chapters 10 through 13
chapter_words |>
  filter(chapter %in% 10:13) |>
  group_by(chapter) |>
  slice_max(tf_idf, n = 5) |>
  ungroup() |>
  mutate(word = reorder(word, tf_idf)) |>
  ggplot(aes(tf_idf, word, fill = factor(chapter))) +
  geom_col(show.legend = FALSE) +
  facet_wrap(~chapter, scales = "free") +
  labs(
    title = "Highest TF-IDF words in Chapters 10-13",
    x = "TF-IDF",
    y = NULL
  ) +
  theme_minimal()


Try the gutenbergr package in your browser

Any scripts or data that you put into this service are public.

gutenbergr documentation built on March 15, 2026, 9:06 a.m.