cache_htids: Caches downloaded JSON Extracted Features files to another...

View source: R/cache_tools.R

cache_htidsR Documentation

Caches downloaded JSON Extracted Features files to another format

Description

This function takes a set of Hathi Trust IDs (usually already downloaded via rsync_from_hathi) and caches the JSON files to another format (e.g., csv or rds or parquet) along them. A typical workflow with this package normally involves selecting an appropriate set of Hathi Trust IDs (via workset_builder), downloading their Extracted Features files to your local machine (via rsync_from_hathi), caching these slow-to-load JSON Extracted Features files to a faster-loading format using cache_htids, and then using read_cached_htids to read them into a single data frame or arrow Dataset for further work.

Usage

cache_htids(
  htids,
  dir = getOption("hathiTools.ef.dir"),
  cache_type = c("ef", "meta", "pagemeta"),
  cache_format = getOption("hathiTools.cacheformat"),
  keep_json = TRUE,
  attempt_rsync = FALSE,
  attempt_parallel = FALSE
)

Arguments

htids

A character vector of Hathi Trust ids, a workset created with workset_builder, or a data frame with a column named "htid" containing the Hathi Trust ids that require caching. If the JSON Extracted Features files for these htids have not been downloaded via rsync_from_hathi or get_hathi_counts to dir, nothing will be cached (unless attempt_rsync is TRUE).

dir

The directory where the download extracted features files are to be found. Defaults to getOption("hathiTools.ef.dir"), which is just "hathi-ef" on load.

cache_type

Type of information cached. The default is c("ef", "meta", "pagemeta"), which refers to the extracted features, the volume metadata, and the page metadata. Omitting one of these caches or finds only the rest (e.g., cache_type = "ef" caches only the EF files, not their associated metadata or page metadata).

cache_format

File format of cache for Extracted Features files. Defaults to getOption("hathiTools.cacheformat"), which is "csv.gz" on load. Allowed cache types are: compressed csv (the default), "none" (no local caching of JSON download; only JSON file kept), "rds", "feather" and "parquet" (suitable for use with arrow; needs the arrow package installed), or "text2vec.csv" (a csv suitable for use with the package text2vec).

keep_json

Whether to keep the downloaded json files. Default is TRUE; if FALSE, it only keeps the local cached files (e.g., the csv files) and deletes the associated JSON files.

attempt_rsync

If TRUE, and some JSON EF files are not found in dir, the function will call rsync_from_hathi to attempt to download these first.

attempt_parallel

Default is FALSE. If TRUE, will attempt to use the furrr package to cache files in parallel. You will need to call future::plan() beforehand to determine the specific parallel strategy to be used; plan(multisession) usually works fine.

Value

A tibble with the paths of the cached files and an indicator of whether each htid was successfully cached.

Examples


htids <- c("mdp.39015008706338", "mdp.39015058109706")
dir <- tempdir()

# Caches nothing (nothing has been downloaded to `dir`):

cache_htids(htids, dir = dir, cache_type = "ef")

# Tries to rsync first, then caches

cache_htids(htids, dir = dir, cache_type = "ef", attempt_rsync = TRUE)



xmarquez/hathiTools documentation built on June 2, 2025, 5:12 a.m.