query_bookworm: Queries the Hathi Trust Bookworm Server at...

View source: R/bookworm.R

query_bookwormR Documentation

Queries the Hathi Trust Bookworm Server at https://bookworm.htrc.illinois.edu/develop/

Description

This function retrieves word frequency data from the Hathi Trust Bookworm Server at https://bookworm.htrc.illinois.edu/develop/, with options to group the results according to various forms of metadata and to limit according to that same metadata. It uses code authored by Ben Schmidt (from https://github.com/bmschmidt/edinburgh/).

Usage

query_bookworm(
  word,
  groups = "date_year",
  ignore_case = TRUE,
  counttype = "WordsPerMillion",
  method = c("data", "returnPossibleFields", "search_results"),
  format = c("json", "csv", "tsv", "feather"),
  lims = c(1920, 2000),
  compare_to,
  as_json = FALSE,
  verbose = FALSE,
  query,
  ...
)

Arguments

word

Term to get frequencies for. Can be a vector of strings. It can be left empty if one is interested primarily in statistics about the corpus as a whole.

groups

Category to group results by. The default is date_year, which groups results by year.

ignore_case

Default is TRUE, ignores case in search.

counttype

The default is words per million, counttype = "WordsPerMillion". According to the API documentation, the following options are available:

WordCount: The number of words matching the terms in search_limits for each group. (If no words key is specified, the sum of all the words in the book).

TextCount: The number of texts matching the constraints on search_limits for each group.

WordsPerMillion: The number of words in the search_limits per million words in the broader set. (Words per million, rather than percent, gives a more legible number).

TextPercent: The percentage of texts in the broader group matching the search terms.

TotalTexts: The number of texts matching the constraints on compare_limits. (By selecting TextCount and TotalTexts, you can derive TextPercent locally, if you prefer).

TotalWords: The number of words in the larger set.

WordsRatio: equal to WordCount/TotalWords. Useful when method = "search_results".

SumWords: equal to TotalWords + WordCount

TextRatio: equal to TextCount/TotalTexts.

SumTexts: equal to TextCount + TotalTexts

It is possible to combine some of these - e.g., counttype = c("TextCount", "TextPercent"). But it is not possible to combine ⁠Text-⁠ counts with ⁠Word-⁠ counts in this version of the API.

method

Type of results to return. Can be data (the default - automatically converted to a proper tibble when possible; the JSON is structured as "nested dicts for each grouping in groups pointing to an array consisting of the results for each count in counttype", according to the API documentation.), returnPossibleFields (metadata fields available to use in groups), and search_results (a list of books and HathiTrust URLs matching a query). Note that search_results has a limit of 100 books at the moment, randomly selected. Notes:

  • When using returnPossibleFields all other fields are ignored.

  • When using search_results only the first 100 results are returned, sorted by the percentage of hits in the text. That biases towards either texts that use the words a lot, or texts that use it rarely. It is possible to use counttype = "WordsRatio" to return a list sorted randomly, weighted by the number of times the word appears in it. The API documentation notes that "this means that a random word from the first text should represent a random usage from the overall sample. The current MySQL-python implementation uses an approximation for this: LOG(1-RAND())/sum(main.count) that should mimic a weighted random ordering for most distributions, but in some cases it may not behave as intended."

format

Format of returned results. In theory the Bookworm DB should be able to return results as "json", "tsv", "csv", or even "feather"; currently only "json" works (and it's the only supported format here).

lims

Min and max year as a two-element numeric vector. Default is c(1920, 2000).

compare_to

A word to compare relative frequencies to. Currently this is most useful with counttype = "WordsRatio"; this compares the relative frequency of two words.

as_json

Whether to return the raw json. Useful for complex queries where the function does not know how to return a tibble, or when you want to use the raw json to produce a different data structure.

verbose

If TRUE, shows the JSON query once built.

query

You can directly pass on a query string (in JSON). This is useful for very complex queries, but there's no checking that the parameters are correct so you may encounter unexpected errors. See https://bookworm-project.github.io/Docs/query_structure.html for more on the query structure. If you use query, all other parameters are silently ignored. Use with care!

...

Additional parameters passed to the query builder; these would be the fields that method = returnPossibleFields returns, including fields to group the query by (e.g., groups = "class"). At the date of this writing, these fields were: lc_classes, lc_subclass, fiction_nonfiction, genres, languages, htsource, digitization_agent_code, mainauthor, publisher, format, is_gov_doc, page_count_bin, word_count_bin, publication_country, publication_state, publication_place. These are not documented, and in some cases one must know the exact string to search for; for example, a search with mainauthor = "Tocqueville" won't find anything, but a search with mainauthor = "Tocqueville, Alexis de 1805-1859." may. These fields should be accessible via options("hathiTools.bookworm.fields")

Value

A tidy tibble whenever possible, with columns for each grouping parameter, the word (if any), and the counts and counttypes. For method = "search_result", a workset that can be used in browse_htids and get_workset_meta.

Author(s)

Ben Schmidt

Examples


query_bookworm(word = c("democracy", "monarchy"), lims = c(1760, 2000),
  counttype = c("WordsPerMillion", "WordCount"))

query_bookworm(word = "democracy", groups = c("date_year", "lc_classes"),
  lims = c(1900,2000))

query_bookworm(word = "democracy", groups = "date_year", date_year = "1941",
  lc_classes = "Education", method = "search_results")


xmarquez/hathiTools documentation built on June 2, 2025, 5:12 a.m.