knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

hathiTools

R-CMD-check

This package allows you to interact with various free data resources made available by the Hathi Trust digital library, including the Hathi Trust Bookworm, a tool similar to the Google ngram viewer and the Hathi Trust Workset Builder 2.0. It also allows you to download and process the Hathi Trust Extracted Features files, which contain per-page word counts and part-of-speech information for over 17 million digitised volumes, including many of those originally digitised by Google for its Google Books project.

Installation

This package is not on CRAN. Install from GitHub as follows:

if(!require(remotes)) { 
  install.packages("remotes") 
}
remotes::install_github("xmarquez/hathiTools")

Downloading word frequencies from the Hathi Trust Bookworm

The simplest task to use the package for is to download word frequencies from the Hathi Trust Bookworm:

library(hathiTools)
library(tidyverse)
library(slider) ## For rolling averages

result <- query_bookworm(word = c("democracy", "monarchy"), lims = c(1760, 2000), counttype = c("WordsPerMillion"))

result

result %>%
  group_by(word, counttype) %>%
  mutate(rolling_avg = slide_dbl(value, mean, .before = 10, .after = 10)) %>%
  ggplot(aes(x = date_year, color = word)) +
  geom_line(aes(y = value), alpha = 0.3) +
  geom_line(aes(x = date_year, y = rolling_avg)) +
  facet_wrap(~counttype) +
  labs(x = "Year", y = "", subtitle = "10 year rolling average, books published between 1760-2000",
       title = "Frequency of 'democracy' and 'monarchy' in the HathiTrust corpus") +
  theme_bw()

There are more than 18 million texts in the latest version of the Bookworm database.

total_texts <- query_bookworm(counttype = c("TotalTexts"), groups = c("date_year", "languages"),
                          lims = c(0,2022))

total_texts %>%
  summarise(value = sum(value))

library(ggrepel)

total_texts %>%
  filter(date_year > 1500, date_year < 2011) %>%
  mutate(languages = fct_lump_n(languages, 10, w = value)) %>%
  group_by(date_year, languages) %>%
  summarise(value = sum(value)) %>%
  group_by(languages) %>%
  mutate(label = ifelse(date_year == max(date_year), as.character(languages), NA_character_),
         rolling_avg = slider::slide_dbl(value, mean, .before = 10, .after = 10)) %>%
  ggplot() +
  geom_line(aes(x = date_year, y = rolling_avg, color = languages), show.legend = FALSE) +
  geom_line(aes(x = date_year, y = value, color = languages), show.legend = FALSE, alpha = 0.3) +
  geom_text_repel(aes(x = date_year, y = value, label = label, color = languages), show.legend = FALSE) +
  scale_y_log10() +
  theme_bw() +
  labs(title = "Total texts per language in the HathiTrust bookworm", 
       subtitle = "Log scale. Less common languages grouped as 'other'. 10 year rolling average.", 
       x = "Year", y = "")

See the article "Using the Hathi Bookworm" for more on how to query the bookworm to get word frequencies grouped by particular fields and/or limited to specific categories.

Creating Worksets of Hathi Trust IDs

We can also create worksets of Hathi Trust IDs for volumes in the digital library that meet specific criteria, such as all volumes that mention "liberal" and "democracy" in the same page, or all volumes with by Alexis de Tocqueville in the "author" field.

result2 <- workset_builder("liberal democracy", volumes_only = FALSE)

result2
result3 <- workset_builder(name = "Alexis de Tocqueville")
result3

We can browse these volumes interactively in the Hathi Trust website:

browse_htids(result2)

See the article "Topic Models Using Hathi Extracted Features" for more on creating and using worksets for specific analysis purposes.

Downloading extracted feature files for specific Hathi Trust volumes and caching them to specific formats

We can download the Extracted Features file associated with any of these HathiTrust IDs:

tmp <- tempdir() 

extracted_features <- get_hathi_counts(result3$htid[2], dir = tmp)

extracted_features

And we can extract the metadata for any of them as well:

meta <- get_hathi_meta(result3$htid[2], dir = tmp)

meta

Including the page-level metadata for any volume:

page_meta <- get_hathi_page_meta(result3$htid[2], dir = tmp)

page_meta

We can also get the metadata for many or all of these books at the same time:

meta <- get_workset_meta(result3[1:5, ], metadata_dir = tmp)

meta

One can also turn a workset into a list of htids for downloading their extracted features via rsync:

tmp <- tempfile()

htid_to_rsync(result3$htid[1:5], file = tmp)

There's a convenience function that will attempt to do this for you in one command, if you have rsync installed.

tmpdir <- tempdir()
rsync_from_hathi(result3[1:5, ], dir = tmpdir)

And you can cache these files to csv or some other fast-loading format also in one command:

cache_htids(result3[1:5, ], dir = tmpdir)

And read them all into memory in one go:

tocqueville_ef <- read_cached_htids(result3[1:5, ], dir = tmpdir)
tocqueville_ef

See the articles "Topic Models Using Hathi Extracted Features" and "An Example Workflow" for more on rsyncing large numbers of Hathi Trust JSON extracted features files and caching them to other formats for analysis.

It is also possible to download the big "hathifile" to get basic metadata for ALL of the texts in the Hathi Trust digital library; this is useful for selecting random samples.

Credits

This package includes some code from the hathidy and edinburgh repos by @bmschmidt.



xmarquez/hathiTools documentation built on June 2, 2025, 5:12 a.m.