knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "README-",
  message = FALSE,
  warning = FALSE
)

gutenbergr: R package to search and download public domain texts from Project Gutenberg

Authors: David Robinson
License: GPL-2

Build Status CRAN_Status_Badge Build status Coverage Status rOpenSci peer-review Project Status: Active – The project has reached a stable, usable state and is being actively developed. R-CMD-check

Download and process public domain works from the Project Gutenberg collection. Includes

Installation

Install the package with:

install.packages("gutenbergr")

Or install the development version using devtools with:

devtools::install_github("ropensci/gutenbergr")

Examples

The gutenberg_works() function retrieves, by default, a table of metadata for all unique English-language Project Gutenberg works that have text associated with them. (The gutenberg_metadata dataset has all Gutenberg works, unfiltered).

options(dplyr.width = 140)
options(width = 100)

Suppose we wanted to download Emily Bronte's "Wuthering Heights." We could find the book's ID by filtering:

library(dplyr)
library(gutenbergr)

gutenberg_works() %>%
  filter(title == "Wuthering Heights")

# or just:
gutenberg_works(title == "Wuthering Heights")

Since we see that it has gutenberg_id 768, we can download it with the gutenberg_download() function:

wuthering_heights <- gutenberg_download(768)
wuthering_heights

gutenberg_download can download multiple books when given multiple IDs. It also takes a meta_fields argument that will add variables from the metadata.

# 1260 is the ID of Jane Eyre
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
books

books %>%
  count(title)

It can also take the output of gutenberg_works directly. For example, we could get the text of all Aristotle's works, each annotated with both gutenberg_id and title, using:

aristotle_books <- gutenberg_works(author == "Aristotle") %>%
  gutenberg_download(meta_fields = "title")

aristotle_books

FAQ

What do I do with the text once I have it?

How were the metadata R files generated?

See the data-raw directory for the scripts that generate these datasets. As of now, these were generated from the Project Gutenberg catalog on r format(attr(gutenberg_metadata, "date_updated"), '%d %B %Y').

Do you respect the rules regarding robot access to Project Gutenberg?

Yes! The package respects these rules and complies to the best of our ability. Namely:

Still, this package is not the right way to download the entire Project Gutenberg corpus (or all from a particular language). For that, follow their recommendation to use wget or set up a mirror. This package is recommended for downloading a single work, or works for a particular author or topic.

Code of Conduct

Please note that the gutenbergr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

ropensci_footer



dgrtwo/gutenbergr documentation built on Jan. 4, 2024, 2:08 p.m.