README.md
In dgrtwo/gutenbergr: Download and Process Public Domain Works from Project Gutenberg

gutenbergr

Download and process public domain works from the Project Gutenberg collection. Includes

A function gutenberg_download() that downloads one or more works from Project Gutenberg by ID: e.g., gutenberg_download(84) downloads the text of Frankenstein.
Metadata for all Project Gutenberg works as R datasets, so that they can be searched and filtered:
gutenberg_metadata contains information about each work, pairing Gutenberg ID with title, author, language, etc
gutenberg_authors contains information about each author, such as aliases and birth/death year
gutenberg_subjects contains pairings of works with Library of Congress subjects and topics

Install the released version of gutenbergr from [CRAN](https://cran.r-project.org/): wzxhzdk:0

Install the development version of gutenbergr from [GitHub](https://github.com/): wzxhzdk:1

The gutenberg_works() function retrieves, by default, a table of metadata for all unique English-language Project Gutenberg works that have text associated with them. (The gutenberg_metadata dataset has all Gutenberg works, unfiltered).

Suppose we wanted to download Emily Bronte’s “Wuthering Heights.” We could find the book’s ID by filtering:

library(dplyr)
library(gutenbergr)

gutenberg_works() |>
  filter(title == "Wuthering Heights")
#> # A tibble: 1 × 8
#>   gutenberg_id title             author        gutenberg_author_id language
#>          <int> <chr>             <chr>                       <int> <fct>   
#> 1          768 Wuthering Heights Brontë, Emily                 405 en      
#>   gutenberg_bookshelf                                                                rights has_text
#>   <chr>                                                                              <fct>  <lgl>   
#> 1 Best Books Ever Listings/Gothic Fiction/Movie Books/Browsing: Literature/Browsing… Publi… TRUE

# or just:
gutenberg_works(title == "Wuthering Heights")
#> # A tibble: 1 × 8
#>   gutenberg_id title             author        gutenberg_author_id language
#>          <int> <chr>             <chr>                       <int> <fct>   
#> 1          768 Wuthering Heights Brontë, Emily                 405 en      
#>   gutenberg_bookshelf                                                                rights has_text
#>   <chr>                                                                              <fct>  <lgl>   
#> 1 Best Books Ever Listings/Gothic Fiction/Movie Books/Browsing: Literature/Browsing… Publi… TRUE

Since we see that it has gutenberg_id 768, we can download it with the gutenberg_download() function:

wuthering_heights <- gutenberg_download(768)
wuthering_heights
#> # A tibble: 12,342 × 2
#>    gutenberg_id text               
#>           <int> <chr>              
#>  1          768 "Wuthering Heights"
#>  2          768 ""                 
#>  3          768 "by Emily Brontë"  
#>  4          768 ""                 
#>  5          768 ""                 
#>  6          768 ""                 
#>  7          768 ""                 
#>  8          768 "CHAPTER I"        
#>  9          768 ""                 
#> 10          768 ""                 
#> # ℹ 12,332 more rows

gutenberg_download can download multiple books when given multiple IDs. It also takes a meta_fields argument that will add variables from the metadata.

# 1260 is the ID of Jane Eyre
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
books
#> # A tibble: 33,343 × 3
#>    gutenberg_id text                title            
#>           <int> <chr>               <chr>            
#>  1          768 "Wuthering Heights" Wuthering Heights
#>  2          768 ""                  Wuthering Heights
#>  3          768 "by Emily Brontë"   Wuthering Heights
#>  4          768 ""                  Wuthering Heights
#>  5          768 ""                  Wuthering Heights
#>  6          768 ""                  Wuthering Heights
#>  7          768 ""                  Wuthering Heights
#>  8          768 "CHAPTER I"         Wuthering Heights
#>  9          768 ""                  Wuthering Heights
#> 10          768 ""                  Wuthering Heights
#> # ℹ 33,333 more rows

books |>
  count(title)
#> # A tibble: 2 × 2
#>   title                           n
#>   <chr>                       <int>
#> 1 Jane Eyre: An Autobiography 21001
#> 2 Wuthering Heights           12342

It can also take the output of gutenberg_works directly. For example, we could get the text of all Aristotle’s works, each annotated with both gutenberg_id and title, using:

aristotle_books <- gutenberg_works(author == "Aristotle") |>
  gutenberg_download(meta_fields = "title")

aristotle_books
#> # A tibble: 43,801 × 3
#>    gutenberg_id text                                                                    
#>           <int> <chr>                                                                   
#>  1         1974 "THE POETICS OF ARISTOTLE"                                              
#>  2         1974 ""                                                                      
#>  3         1974 "By Aristotle"                                                          
#>  4         1974 ""                                                                      
#>  5         1974 "A Translation By S. H. Butcher"                                        
#>  6         1974 ""                                                                      
#>  7         1974 ""                                                                      
#>  8         1974 "[Transcriber's Annotations and Conventions: the translator left"       
#>  9         1974 "intact some Greek words to illustrate a specific point of the original"
#> 10         1974 "discourse. In this transcription, in order to retain the accuracy of"  
#>    title                   
#>    <chr>                   
#>  1 The Poetics of Aristotle
#>  2 The Poetics of Aristotle
#>  3 The Poetics of Aristotle
#>  4 The Poetics of Aristotle
#>  5 The Poetics of Aristotle
#>  6 The Poetics of Aristotle
#>  7 The Poetics of Aristotle
#>  8 The Poetics of Aristotle
#>  9 The Poetics of Aristotle
#> 10 The Poetics of Aristotle
#> # ℹ 43,791 more rows

The Natural Language Processing CRAN View suggests many R packages related to text mining, especially around the tm package.
The tidytext package is useful for tokenization and analysis, especially since gutenbergr downloads books as a data frame already.
You could match the wikipedia column in gutenberg_author to Wikipedia content with the WikipediR package or to pageview statistics with the wikipediatrend package.
If you’re considering an analysis based on author name, you may find the humaniformat (for extraction of first names) and gender (prediction of gender from first names) packages useful. (Note that humaniformat has a format_reverse function for reversing “Last, First” names).

See the data-raw directory for the scripts that generate these datasets. As of now, these were generated from the Project Gutenberg catalog on 27 May 2025.

Yes! The package respects these rules and complies to the best of our ability. Namely:

Project Gutenberg allows harvesting with automated software using this list of links. The gutenbergr package visits that page once to find the recommended mirror for the user’s location.
We retrieve the book text directly from that mirror using links in the same format. For example, Frankenstein (book 84) is retrieved from https://www.gutenberg.lib.md.us/8/84/84.zip.
We give priority to retrieving the .zip file to minimize bandwidth on the mirror. .txt files are only retrieved if there is no .zip.

Still, this package is not the right way to download the entire Project Gutenberg corpus (or all from a particular language). For that, follow their recommendation to set up a mirror. This package is recommended for downloading a single work, or works for a particular author or topic. See their Terms of Service for details.

Please note that the gutenbergr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

dgrtwo/gutenbergr documentation built on June 12, 2025, 2:33 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

dgrtwo/gutenbergr
Download and Process Public Domain Works from Project Gutenberg

README.md
In dgrtwo/gutenbergr: Download and Process Public Domain Works from Project Gutenberg

gutenbergr

Installation

Examples

FAQ

What do I do with the text once I have it?

How were the metadata R files generated?

Do you respect the rules regarding robot access to Project Gutenberg?

Code of Conduct

R Package Documentation

Browse R Packages

We want your feedback!

dgrtwo/gutenbergr Download and Process Public Domain Works from Project Gutenberg

README.md In dgrtwo/gutenbergr: Download and Process Public Domain Works from Project Gutenberg

gutenbergr

Installation

Examples

FAQ

What do I do with the text once I have it?

How were the metadata R files generated?

Do you respect the rules regarding robot access to Project Gutenberg?

Code of Conduct

R Package Documentation

Browse R Packages

We want your feedback!

dgrtwo/gutenbergr
Download and Process Public Domain Works from Project Gutenberg

README.md
In dgrtwo/gutenbergr: Download and Process Public Domain Works from Project Gutenberg