knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
This package allows you to interact with various free data resources made available by the Hathi Trust digital library, including the Hathi Trust Bookworm, a tool similar to the Google ngram viewer and the Hathi Trust Workset Builder 2.0. It also allows you to download and process the Hathi Trust Extracted Features files, which contain per-page word counts and part-of-speech information for over 17 million digitised volumes, including many of those originally digitised by Google for its Google Books project.
This package is not on CRAN. Install from GitHub as follows:
if(!require(remotes)) { install.packages("remotes") } remotes::install_github("xmarquez/hathiTools")
The simplest task to use the package for is to download word frequencies from the Hathi Trust Bookworm:
library(hathiTools) library(tidyverse) library(slider) ## For rolling averages result <- query_bookworm(word = c("democracy", "monarchy"), lims = c(1760, 2000), counttype = c("WordsPerMillion")) result result %>% group_by(word, counttype) %>% mutate(rolling_avg = slide_dbl(value, mean, .before = 10, .after = 10)) %>% ggplot(aes(x = date_year, color = word)) + geom_line(aes(y = value), alpha = 0.3) + geom_line(aes(x = date_year, y = rolling_avg)) + facet_wrap(~counttype) + labs(x = "Year", y = "", subtitle = "10 year rolling average, books published between 1760-2000", title = "Frequency of 'democracy' and 'monarchy' in the HathiTrust corpus") + theme_bw()
There are more than 18 million texts in the latest version of the Bookworm database.
total_texts <- query_bookworm(counttype = c("TotalTexts"), groups = c("date_year", "languages"), lims = c(0,2022)) total_texts %>% summarise(value = sum(value)) library(ggrepel) total_texts %>% filter(date_year > 1500, date_year < 2011) %>% mutate(languages = fct_lump_n(languages, 10, w = value)) %>% group_by(date_year, languages) %>% summarise(value = sum(value)) %>% group_by(languages) %>% mutate(label = ifelse(date_year == max(date_year), as.character(languages), NA_character_), rolling_avg = slider::slide_dbl(value, mean, .before = 10, .after = 10)) %>% ggplot() + geom_line(aes(x = date_year, y = rolling_avg, color = languages), show.legend = FALSE) + geom_line(aes(x = date_year, y = value, color = languages), show.legend = FALSE, alpha = 0.3) + geom_text_repel(aes(x = date_year, y = value, label = label, color = languages), show.legend = FALSE) + scale_y_log10() + theme_bw() + labs(title = "Total texts per language in the HathiTrust bookworm", subtitle = "Log scale. Less common languages grouped as 'other'. 10 year rolling average.", x = "Year", y = "")
See the article "Using the Hathi Bookworm" for more on how to query the bookworm to get word frequencies grouped by particular fields and/or limited to specific categories.
We can also create worksets of Hathi Trust IDs for volumes in the digital library that meet specific criteria, such as all volumes that mention "liberal" and "democracy" in the same page, or all volumes with by Alexis de Tocqueville in the "author" field.
result2 <- workset_builder("liberal democracy", volumes_only = FALSE) result2
result3 <- workset_builder(name = "Alexis de Tocqueville") result3
We can browse these volumes interactively in the Hathi Trust website:
browse_htids(result2)
See the article "Topic Models Using Hathi Extracted Features" for more on creating and using worksets for specific analysis purposes.
We can download the Extracted Features file associated with any of these HathiTrust IDs:
tmp <- tempdir() extracted_features <- get_hathi_counts(result3$htid[2], dir = tmp) extracted_features
And we can extract the metadata for any of them as well:
meta <- get_hathi_meta(result3$htid[2], dir = tmp) meta
Including the page-level metadata for any volume:
page_meta <- get_hathi_page_meta(result3$htid[2], dir = tmp) page_meta
We can also get the metadata for many or all of these books at the same time:
meta <- get_workset_meta(result3[1:5, ], metadata_dir = tmp) meta
One can also turn a workset into a list of htids for downloading their extracted features via rsync:
tmp <- tempfile() htid_to_rsync(result3$htid[1:5], file = tmp)
There's a convenience function that will attempt to do this for you in one command, if you have rsync installed.
tmpdir <- tempdir() rsync_from_hathi(result3[1:5, ], dir = tmpdir)
And you can cache these files to csv or some other fast-loading format also in one command:
cache_htids(result3[1:5, ], dir = tmpdir)
And read them all into memory in one go:
tocqueville_ef <- read_cached_htids(result3[1:5, ], dir = tmpdir) tocqueville_ef
See the articles "Topic Models Using Hathi Extracted Features" and "An Example Workflow" for more on rsyncing large numbers of Hathi Trust JSON extracted features files and caching them to other formats for analysis.
It is also possible to download the big "hathifile" to get basic metadata for ALL of the texts in the Hathi Trust digital library; this is useful for selecting random samples.
This package includes some code from the hathidy and edinburgh repos by @bmschmidt.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.