gutenbergr: Search and download public domain texts from Project Gutenberg
In gutenbergr: Download and Process Public Domain Works from Project Gutenberg

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  message = FALSE,
  warning = FALSE
)

The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:

A function gutenberg_download() that downloads one or more works from Project Gutenberg by ID: e.g., gutenberg_download(84) downloads the text of Frankenstein.
Metadata for all Project Gutenberg works as R datasets, so that they can be searched and filtered:
gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc
gutenberg_authors contains information about each author, such as aliases and birth/death year
gutenberg_subjects contains pairings of works with Library of Congress subjects and topics

Project Gutenberg Metadata

This package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.

The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:

library(gutenbergr)
library(dplyr)
gutenberg_metadata

For example, you could find the Gutenberg ID(s) of Jane Austen's Persuasion by doing:

gutenberg_metadata %>%
  filter(title == "Persuasion")

In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works() function does this pre-filtering:

gutenberg_works()

It also allows you to perform filtering as an argument:

gutenberg_works(author == "Austen, Jane")

# or with a regular expression

library(stringr)
gutenberg_works(str_detect(author, "Austen"))

The meta-data currently in the package was last updated on r format(attr(gutenberg_metadata, "date_updated"), '%d %B %Y').

Downloading books by ID

The function gutenberg_download() downloads one or more works from Project Gutenberg based on their ID. For example, we earlier saw that one version of Persuasion has ID 105 (see the URL here), so gutenberg_download(105) downloads this text.

f105 <- system.file("extdata", "105.zip", package = "gutenbergr")
persuasion <- gutenberg_download(105,
  files = f105,
  mirror = "http://aleph.gutenberg.org"
)

persuasion <- gutenberg_download(105)

persuasion

Notice it is returned as a tbl_df (a type of data frame) including two variables: gutenberg_id (useful if multiple books are returned), and a character vector of the text, one row per line.

You can also provide gutenberg_download() a vector of IDs to download multiple books. For example, to download Renascence, and Other Poems (book 109) along with Persuasion, do:

f109 <- system.file("extdata", "109.zip", package = "gutenbergr")
books <- gutenberg_download(c(109, 105),
  meta_fields = "title",
  files = c(f109, f105),
  mirror = "http://aleph.gutenberg.org"
)

books <- gutenberg_download(c(109, 105), meta_fields = "title")

books

Notice that the meta_fields argument allows us to add one or more additional fields from the gutenberg_metadata to the downloaded text, such as title or author.

books %>%
  count(title)

Other meta-datasets

You may want to select books based on information other than their title or author, such as their genre or topic. gutenberg_subjects contains pairings of works with Library of Congress subjects and topics. "lcc" means Library of Congress Classification, while "lcsh" means Library of Congress subject headings:

gutenberg_subjects

This is useful for extracting texts from a particular topic or genre, such as detective stories, or a particular character, such as Sherlock Holmes. The gutenberg_id column can then be used to download these texts or to link with other metadata.

gutenberg_subjects %>%
  filter(subject == "Detective and mystery stories")

gutenberg_subjects %>%
  filter(grepl("Holmes, Sherlock", subject))

gutenberg_authors contains information about each author, such as aliases and birth/death year:

gutenberg_authors

Analysis

What's next after retrieving a book's text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.

library(tidytext)

words <- books %>%
  unnest_tokens(word, text)

words

word_counts <- words %>%
  anti_join(stop_words, by = "word") %>%
  count(title, word, sort = TRUE)

word_counts

You may also find these resources useful:

The Natural Language Processing CRAN View suggests many R packages related to text mining, especially around the tm package
You could match the wikipedia column in gutenberg_author to Wikipedia content with the WikipediR package or to pageview statistics with the wikipediatrend package
If you're considering an analysis based on author name, you may find the humaniformat (for extraction of first names) and gender (prediction of gender from first names) packages useful. (Note that humaniformat has a format_reverse function for reversing "Last, First" names).