README.md

gutenbergr: R package to search and download public domain texts from Project Gutenberg

Authors: David Robinson License: GPL-2

Build
Status CRAN_Status_Badge Build
status Coverage
Status rOpenSci
peer-review Project Status: Active – The project has reached a stable, usable
state and is being actively
developed. R-CMD-check

Download and process public domain works from the Project Gutenberg collection. Includes

Installation

Install the package with:

install.packages("gutenbergr")

Or install the development version using devtools with:

devtools::install_github("ropensci/gutenbergr")

Examples

The gutenberg_works() function retrieves, by default, a table of metadata for all unique English-language Project Gutenberg works that have text associated with them. (The gutenberg_metadata dataset has all Gutenberg works, unfiltered).

Suppose we wanted to download Emily Bronte’s “Wuthering Heights.” We could find the book’s ID by filtering:

library(dplyr)
library(gutenbergr)

gutenberg_works() %>%
  filter(title == "Wuthering Heights")
#> # A tibble: 1 × 8
#>   gutenberg_id title             author        gutenberg_author_id language
#>          <int> <chr>             <chr>                       <int> <chr>   
#> 1          768 Wuthering Heights Brontë, Emily                 405 en      
#>   gutenberg_bookshelf                                 rights                    has_text
#>   <chr>                                               <chr>                     <lgl>   
#> 1 Best Books Ever Listings/Gothic Fiction/Movie Books Public domain in the USA. TRUE

# or just:
gutenberg_works(title == "Wuthering Heights")
#> # A tibble: 1 × 8
#>   gutenberg_id title             author        gutenberg_author_id language
#>          <int> <chr>             <chr>                       <int> <chr>   
#> 1          768 Wuthering Heights Brontë, Emily                 405 en      
#>   gutenberg_bookshelf                                 rights                    has_text
#>   <chr>                                               <chr>                     <lgl>   
#> 1 Best Books Ever Listings/Gothic Fiction/Movie Books Public domain in the USA. TRUE

Since we see that it has gutenberg_id 768, we can download it with the gutenberg_download() function:

wuthering_heights <- gutenberg_download(768)
wuthering_heights
#> # A tibble: 12,342 × 2
#>    gutenberg_id text               
#>           <int> <chr>              
#>  1          768 "Wuthering Heights"
#>  2          768 ""                 
#>  3          768 "by Emily Brontë"  
#>  4          768 ""                 
#>  5          768 ""                 
#>  6          768 ""                 
#>  7          768 ""                 
#>  8          768 "CHAPTER I"        
#>  9          768 ""                 
#> 10          768 ""                 
#> # ℹ 12,332 more rows

gutenberg_download can download multiple books when given multiple IDs. It also takes a meta_fields argument that will add variables from the metadata.

# 1260 is the ID of Jane Eyre
books <- gutenberg_download(c(768, 1260), meta_fields = "title")
books
#> # A tibble: 33,343 × 3
#>    gutenberg_id text                title            
#>           <int> <chr>               <chr>            
#>  1          768 "Wuthering Heights" Wuthering Heights
#>  2          768 ""                  Wuthering Heights
#>  3          768 "by Emily Brontë"   Wuthering Heights
#>  4          768 ""                  Wuthering Heights
#>  5          768 ""                  Wuthering Heights
#>  6          768 ""                  Wuthering Heights
#>  7          768 ""                  Wuthering Heights
#>  8          768 "CHAPTER I"         Wuthering Heights
#>  9          768 ""                  Wuthering Heights
#> 10          768 ""                  Wuthering Heights
#> # ℹ 33,333 more rows

books %>%
  count(title)
#> # A tibble: 2 × 2
#>   title                           n
#>   <chr>                       <int>
#> 1 Jane Eyre: An Autobiography 21001
#> 2 Wuthering Heights           12342

It can also take the output of gutenberg_works directly. For example, we could get the text of all Aristotle’s works, each annotated with both gutenberg_id and title, using:

aristotle_books <- gutenberg_works(author == "Aristotle") %>%
  gutenberg_download(meta_fields = "title")

aristotle_books
#> # A tibble: 17,147 × 3
#>    gutenberg_id text                                                                    
#>           <int> <chr>                                                                   
#>  1         1974 "THE POETICS OF ARISTOTLE"                                              
#>  2         1974 ""                                                                      
#>  3         1974 "By Aristotle"                                                          
#>  4         1974 ""                                                                      
#>  5         1974 "A Translation By S. H. Butcher"                                        
#>  6         1974 ""                                                                      
#>  7         1974 ""                                                                      
#>  8         1974 "[Transcriber's Annotations and Conventions: the translator left"       
#>  9         1974 "intact some Greek words to illustrate a specific point of the original"
#> 10         1974 "discourse. In this transcription, in order to retain the accuracy of"  
#>    title                   
#>    <chr>                   
#>  1 The Poetics of Aristotle
#>  2 The Poetics of Aristotle
#>  3 The Poetics of Aristotle
#>  4 The Poetics of Aristotle
#>  5 The Poetics of Aristotle
#>  6 The Poetics of Aristotle
#>  7 The Poetics of Aristotle
#>  8 The Poetics of Aristotle
#>  9 The Poetics of Aristotle
#> 10 The Poetics of Aristotle
#> # ℹ 17,137 more rows

FAQ

What do I do with the text once I have it?

How were the metadata R files generated?

See the data-raw directory for the scripts that generate these datasets. As of now, these were generated from the Project Gutenberg catalog on 19 December 2022.

Do you respect the rules regarding robot access to Project Gutenberg?

Yes! The package respects these rules and complies to the best of our ability. Namely:

Still, this package is not the right way to download the entire Project Gutenberg corpus (or all from a particular language). For that, follow their recommendation to use wget or set up a mirror. This package is recommended for downloading a single work, or works for a particular author or topic.

Code of Conduct

Please note that the gutenbergr project is released with a Contributor Code of Conduct. By contributing to this project, you agree to abide by its terms.

ropensci_footer



dgrtwo/gutenbergr documentation built on Jan. 4, 2024, 2:08 p.m.