gutenberg_strip: Strip header and footer content from a Project Gutenberg book

View source: R/gutenberg_strip.R

gutenberg_stripR Documentation

Strip header and footer content from a Project Gutenberg book

Description

Strip header and footer content from a Project Gutenberg book. This is based on formatting heuristics (regular expression guesses), so it may not be perfect.

Usage

gutenberg_strip(text)

Arguments

text

A character vector where each element is a line of a book.

Details

This function identifies the Project Gutenberg "start" and "end" markers. It also attempts to strip out initial metadata paragraphs (such as "Produced by...", "Transcribed from...", etc.).

Note that this will not strip:

  • Tables of contents

  • Prologues or introductions

  • Other author-written text that appears at the start of a book

Value

A character vector with Project Gutenberg headers and footers removed.

Examples


library(dplyr)

# Download a book without stripping to see the headers
book <- gutenberg_works(title == "Pride and Prejudice") |>
  gutenberg_download(strip = FALSE)

# Look at the raw header and footer
head(book$text, 20)
tail(book$text, 20)

# Manually strip the text
text_stripped <- gutenberg_strip(book$text)

# Check the cleaned results
head(text_stripped, 10)
tail(text_stripped, 10)


gutenbergr documentation built on Jan. 19, 2026, 9:07 a.m.