query_bookworm: Queries the Hathi Trust Bookworm Server at...
In xmarquez/hathiTools: Access the Hathi Trust Bookworm and Extracted Features Files from R

query_bookworm

R Documentation

Queries the Hathi Trust Bookworm Server at https://bookworm.htrc.illinois.edu/develop/

Description

This function retrieves word frequency data from the Hathi Trust Bookworm Server at https://bookworm.htrc.illinois.edu/develop/, with options to group the results according to various forms of metadata and to limit according to that same metadata. It uses code authored by Ben Schmidt (from https://github.com/bmschmidt/edinburgh/).

Usage

query_bookworm(
  word,
  groups = "date_year",
  ignore_case = TRUE,
  counttype = "WordsPerMillion",
  method = c("data", "returnPossibleFields", "search_results"),
  format = c("json", "csv", "tsv", "feather"),
  lims = c(1920, 2000),
  compare_to,
  as_json = FALSE,
  verbose = FALSE,
  query,
  ...
)

Arguments

`word`	Term to get frequencies for. Can be a vector of strings. It can be left empty if one is interested primarily in statistics about the corpus as a whole.
`groups`	Category to group results by. The default is `date_year`, which groups results by year.
`ignore_case`	Default is `TRUE`, ignores case in search.
`counttype`	The default is words per million, `counttype = "WordsPerMillion"`. According to the API documentation, the following options are available: `WordCount`: The number of words matching the terms in `search_limits` for each group. (If no `words` key is specified, the sum of all the words in the book). `TextCount`: The number of texts matching the constraints on `search_limits` for each group. `WordsPerMillion`: The number of words in the `search_limits` per million words in the broader set. (Words per million, rather than percent, gives a more legible number). `TextPercent`: The percentage of texts in the broader group matching the search terms. `TotalTexts`: The number of texts matching the constraints on `compare_limits`. (By selecting `TextCount` and `TotalTexts`, you can derive `TextPercent` locally, if you prefer). `TotalWords`: The number of words in the larger set. `WordsRatio`: equal to `WordCount/TotalWords`. Useful when `method = "search_results"`. `SumWords`: equal to `TotalWords + WordCount` `TextRatio`: equal to `TextCount/TotalTexts`. `SumTexts`: equal to `TextCount + TotalTexts` It is possible to combine some of these - e.g., counttype = c("TextCount", "TextPercent"). But it is not possible to combine `⁠Text-⁠` counts with `⁠Word-⁠` counts in this version of the API.
`method`	Type of results to return. Can be `data` (the default - automatically converted to a proper tibble when possible; the JSON is structured as "nested dicts for each grouping in `groups` pointing to an array consisting of the results for each count in `counttype`", according to the API documentation.), `returnPossibleFields` (metadata fields available to use in `groups`), and `search_results` (a list of books and HathiTrust URLs matching a query). Note that `search_results` has a limit of 100 books at the moment, randomly selected. Notes: When using `returnPossibleFields` all other fields are ignored. When using `search_results` only the first 100 results are returned, sorted by the percentage of hits in the text. That biases towards either texts that use the words a lot, or texts that use it rarely. It is possible to use `counttype = "WordsRatio"` to return a list sorted randomly, weighted by the number of times the word appears in it. The API documentation notes that "this means that a random word from the first text should represent a random usage from the overall sample. The current MySQL-python implementation uses an approximation for this: `LOG(1-RAND())/sum(main.count)` that should mimic a weighted random ordering for most distributions, but in some cases it may not behave as intended."
`format`	Format of returned results. In theory the Bookworm DB should be able to return results as "json", "tsv", "csv", or even "feather"; currently only "json" works (and it's the only supported format here).
`lims`	Min and max year as a two-element numeric vector. Default is `c(1920, 2000)`.
`compare_to`	A word to compare relative frequencies to. Currently this is most useful with `counttype = "WordsRatio"`; this compares the relative frequency of two words.
`as_json`	Whether to return the raw json. Useful for complex queries where the function does not know how to return a tibble, or when you want to use the raw json to produce a different data structure.
`verbose`	If `TRUE`, shows the JSON query once built.
`query`	You can directly pass on a query string (in JSON). This is useful for very complex queries, but there's no checking that the parameters are correct so you may encounter unexpected errors. See https://bookworm-project.github.io/Docs/query_structure.html for more on the query structure. If you use `query`, all other parameters are silently ignored. Use with care!
`...`	Additional parameters passed to the query builder; these would be the fields that method = `returnPossibleFields` returns, including fields to group the query by (e.g., groups = "class"). At the date of this writing, these fields were: lc_classes, lc_subclass, fiction_nonfiction, genres, languages, htsource, digitization_agent_code, mainauthor, publisher, format, is_gov_doc, page_count_bin, word_count_bin, publication_country, publication_state, publication_place. These are not documented, and in some cases one must know the exact string to search for; for example, a search with `mainauthor = "Tocqueville"` won't find anything, but a search with `mainauthor = "Tocqueville, Alexis de 1805-1859."` may. These fields should be accessible via `options("hathiTools.bookworm.fields")`

Value

A tidy tibble whenever possible, with columns for each grouping parameter, the word (if any), and the counts and counttypes. For method = "search_result", a workset that can be used in browse_htids and get_workset_meta.

Author(s)

Ben Schmidt

Examples


query_bookworm(word = c("democracy", "monarchy"), lims = c(1760, 2000),
  counttype = c("WordsPerMillion", "WordCount"))

query_bookworm(word = "democracy", groups = c("date_year", "lc_classes"),
  lims = c(1900,2000))

query_bookworm(word = "democracy", groups = "date_year", date_year = "1941",
  lc_classes = "Education", method = "search_results")

xmarquez/hathiTools documentation built on June 2, 2025, 5:12 a.m.

xmarquez/hathiTools index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

xmarquez/hathiTools
Access the Hathi Trust Bookworm and Extracted Features Files from R

query_bookworm: Queries the Hathi Trust Bookworm Server at...
In xmarquez/hathiTools: Access the Hathi Trust Bookworm and Extracted Features Files from R

Queries the Hathi Trust Bookworm Server at https://bookworm.htrc.illinois.edu/develop/

Description

Usage

Arguments

Value

Author(s)

Examples

Related to query_bookworm in xmarquez/hathiTools...

R Package Documentation

Browse R Packages

We want your feedback!

xmarquez/hathiTools Access the Hathi Trust Bookworm and Extracted Features Files from R

query_bookworm: Queries the Hathi Trust Bookworm Server at... In xmarquez/hathiTools: Access the Hathi Trust Bookworm and Extracted Features Files from R

Queries the Hathi Trust Bookworm Server at https://bookworm.htrc.illinois.edu/develop/

Description

Usage

Arguments

Value

Author(s)

Examples

Related to query_bookworm in xmarquez/hathiTools...

R Package Documentation

Browse R Packages

We want your feedback!

xmarquez/hathiTools
Access the Hathi Trust Bookworm and Extracted Features Files from R

query_bookworm: Queries the Hathi Trust Bookworm Server at...
In xmarquez/hathiTools: Access the Hathi Trust Bookworm and Extracted Features Files from R