cache_htids: Caches downloaded JSON Extracted Features files to another...
In xmarquez/hathiTools: Access the Hathi Trust Bookworm and Extracted Features Files from R

cache_htids

R Documentation

Caches downloaded JSON Extracted Features files to another format

Description

This function takes a set of Hathi Trust IDs (usually already downloaded via rsync_from_hathi) and caches the JSON files to another format (e.g., csv or rds or parquet) along them. A typical workflow with this package normally involves selecting an appropriate set of Hathi Trust IDs (via workset_builder), downloading their Extracted Features files to your local machine (via rsync_from_hathi), caching these slow-to-load JSON Extracted Features files to a faster-loading format using cache_htids, and then using read_cached_htids to read them into a single data frame or arrow Dataset for further work.

Usage

cache_htids(
  htids,
  dir = getOption("hathiTools.ef.dir"),
  cache_type = c("ef", "meta", "pagemeta"),
  cache_format = getOption("hathiTools.cacheformat"),
  keep_json = TRUE,
  attempt_rsync = FALSE,
  attempt_parallel = FALSE
)

Arguments

`htids`	A character vector of Hathi Trust ids, a workset created with workset_builder, or a data frame with a column named "htid" containing the Hathi Trust ids that require caching. If the JSON Extracted Features files for these htids have not been downloaded via rsync_from_hathi or get_hathi_counts to `dir`, nothing will be cached (unless `attempt_rsync` is `TRUE`).
`dir`	The directory where the download extracted features files are to be found. Defaults to `getOption("hathiTools.ef.dir")`, which is just "hathi-ef" on load.
`cache_type`	Type of information cached. The default is c("ef", "meta", "pagemeta"), which refers to the extracted features, the volume metadata, and the page metadata. Omitting one of these caches or finds only the rest (e.g., `cache_type = "ef"` caches only the EF files, not their associated metadata or page metadata).
`cache_format`	File format of cache for Extracted Features files. Defaults to `getOption("hathiTools.cacheformat")`, which is "csv.gz" on load. Allowed cache types are: compressed csv (the default), "none" (no local caching of JSON download; only JSON file kept), "rds", "feather" and "parquet" (suitable for use with arrow; needs the arrow package installed), or "text2vec.csv" (a csv suitable for use with the package text2vec).
`keep_json`	Whether to keep the downloaded json files. Default is `TRUE`; if `FALSE`, it only keeps the local cached files (e.g., the csv files) and deletes the associated JSON files.
`attempt_rsync`	If `TRUE`, and some JSON EF files are not found in `dir`, the function will call rsync_from_hathi to attempt to download these first.
`attempt_parallel`	Default is `FALSE`. If `TRUE`, will attempt to use the furrr package to cache files in parallel. You will need to call `future::plan()` beforehand to determine the specific parallel strategy to be used; `plan(multisession)` usually works fine.

Value

A tibble with the paths of the cached files and an indicator of whether each htid was successfully cached.

Examples


htids <- c("mdp.39015008706338", "mdp.39015058109706")
dir <- tempdir()

# Caches nothing (nothing has been downloaded to `dir`):

cache_htids(htids, dir = dir, cache_type = "ef")

# Tries to rsync first, then caches

cache_htids(htids, dir = dir, cache_type = "ef", attempt_rsync = TRUE)

xmarquez/hathiTools documentation built on June 2, 2025, 5:12 a.m.

xmarquez/hathiTools index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

xmarquez/hathiTools
Access the Hathi Trust Bookworm and Extracted Features Files from R

cache_htids: Caches downloaded JSON Extracted Features files to another...
In xmarquez/hathiTools: Access the Hathi Trust Bookworm and Extracted Features Files from R

Caches downloaded JSON Extracted Features files to another format

Description

Usage

Arguments

Value

Examples

Related to cache_htids in xmarquez/hathiTools...

R Package Documentation

Browse R Packages

We want your feedback!

xmarquez/hathiTools Access the Hathi Trust Bookworm and Extracted Features Files from R

cache_htids: Caches downloaded JSON Extracted Features files to another... In xmarquez/hathiTools: Access the Hathi Trust Bookworm and Extracted Features Files from R

Caches downloaded JSON Extracted Features files to another format

Description

Usage

Arguments

Value

Examples

Related to cache_htids in xmarquez/hathiTools...

R Package Documentation

Browse R Packages

We want your feedback!

xmarquez/hathiTools
Access the Hathi Trust Bookworm and Extracted Features Files from R

cache_htids: Caches downloaded JSON Extracted Features files to another...
In xmarquez/hathiTools: Access the Hathi Trust Bookworm and Extracted Features Files from R