gutenbergr: Search and download public domain texts from Project Gutenberg

knitr::opts_chunk$set(message = FALSE, warning = FALSE)

The gutenbergr package helps you download and process public domain works from the Project Gutenberg collection. This includes both tools for downloading books (and stripping header/footer information), and a complete dataset of Project Gutenberg metadata that can be used to find words of interest. Includes:

Project Gutenberg Metadata

This package contains metadata for all Project Gutenberg works as R datasets, so that you can search and filter for particular works before downloading.

The dataset gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc:

library(gutenbergr)
gutenberg_metadata

For example, you could find the Gutenberg ID of Wuthering Heights by doing:

library(dplyr)

gutenberg_metadata %>%
  filter(title == "Wuthering Heights")

In many analyses, you may want to filter just for English works, avoid duplicates, and include only books that have text that can be downloaded. The gutenberg_works() function does this pre-filtering:

gutenberg_works()

It also allows you to perform filtering as an argument:

gutenberg_works(author == "Austen, Jane")

# or with a regular expression

library(stringr)
gutenberg_works(str_detect(author, "Austen"))

The meta-data currently in the package was last updated on r format(attr(gutenberg_metadata, "date_updated"), '%d %B %Y').

Downloading books by ID

The function gutenberg_download() downloads one or more works from Project Gutenberg based on their ID. For example, we earlier saw that "Wuthering Heights" has ID 768 (see the URL here), so gutenberg_download(768) downloads this text.

wuthering_heights <- gutenberg_download(768)

wuthering_heights

Notice it is returned as a tbl_df (a type of data frame) including two variables: gutenberg_id (useful if multiple books are returned), and a character vector of the text, one row per line. Notice that the header and footer added by Project Gutenberg (visible here) have been stripped away.

Provide a vector of IDs to download multiple books. For example, to download Jane Eyre (book 1260) along with Wuthering Heights, do:

books <- gutenberg_download(c(768, 1260), meta_fields = "title")

books

Notice that the meta_fields argument allows us to add one or more additional fields from the gutenberg_metadata to the downloaded text, such as title or author.

books %>%
  count(title)

Other meta-datasets

You may want to select books based on information other than their title or author, such as their genre or topic. gutenberg_subjects contains pairings of works with Library of Congress subjects and topics. "lcc" means Library of Congress Classification, while "lcsh" means Library of Congress subject headings:

gutenberg_subjects

This is useful for extracting texts from a particular topic or genre, such as detective stories, or a particular character, such as Sherlock Holmes. The gutenberg_id column can then be used to download these texts or to link with other metadata.

gutenberg_subjects %>%
  filter(subject == "Detective and mystery stories")

gutenberg_subjects %>%
  filter(grepl("Holmes, Sherlock", subject))

gutenberg_authors contains information about each author, such as aliases and birth/death year:

gutenberg_authors

Analysis

What's next after retrieving a book's text? Well, having the book as a data frame is especially useful for working with the tidytext package for text analysis.

library(tidytext)

words <- books %>%
  unnest_tokens(word, text)

words

word_counts <- words %>%
  anti_join(stop_words, by = "word") %>%
  count(title, word, sort = TRUE)

word_counts

You may also find these resources useful:



Try the gutenbergr package in your browser

Any scripts or data that you put into this service are public.

gutenbergr documentation built on Jan. 26, 2018, 5:03 p.m.