R/fetch_glean_vd_chunked.R

Defines functions fetch_glean_vd_chunked

Documented in fetch_glean_vd_chunked

#' @title Fetch and parse multiple VecDyn datasets by ID in chunks
#' @description Retrieve and parse VecDyn datasets specified by their dataset IDs in batches.
#'
#' This is not usually necessary (generally you just need [fetch_vd()]) but allows one to release data that is not in use from memory. If you would like more control on extraction or parsing then it is best to wrap [fetch_vd()] and [glean_vd()] in your own chunker instead.
#' @author Francis Windram
#'
#' @param ids a numeric vector of IDs (preferably in an `ohvbd.ids` object) indicating the particular datasets to download.
#' @param chunksize an integer defining the size of chunks to retrieve in one iteration.
#' @param cols a character vector of columns to extract from the dataset.
#' @param returnunique whether to return only the unique rows within each dataset according to the filtered columns.
#' @param rate maximum number of calls to the API per second.
#' @param connections number of simultaneous connections to the server at once. Maximum 8. **Do not enable unless you really need to** as this hits the server significantly harder than usual.
#' @param basereq an [httr2 request][httr2::request()] object, as generated by [vb_basereq()]. If `NA`, uses the default request.
#'
#' @return An `ohvbd.data.frame` containing the requested data.
#'
#' @examplesIf interactive()
#' fetch_glean_vd_chunked(c(423,424,425), chunksize = 2, rate=5)
#'
#' @concept vecdyn
#'
#' @export
#'

fetch_glean_vd_chunked <- function(
  ids,
  chunksize = 20,
  cols = NULL,
  returnunique = FALSE,
  rate = 5,
  connections = 2,
  basereq = vb_basereq()
) {
  check_provenance(ids, "vd", altfunc = "fetch_glean", altfunc_suffix = "chunked")

  # Get and extract vt data by ID in chunks (to save memory)
  # Split into chunks
  breakpoints <- seq(0, length(ids) + (chunksize - 1), by = chunksize)
  chunks <- cut(seq_along(ids), breaks = breakpoints, labels = FALSE)
  chunklets <- split(ids, chunks)

  # Lapply pipeline to chunk list
  out_list <- chunklets |>
    lapply(\(idchunk) {
      fetch_vd(
        idchunk,
        rate = rate,
        connections = connections,
        basereq = basereq
      ) |>
        glean_vd(cols = cols, returnunique = returnunique)
    })

  out_df <- suppressWarnings(data.table::rbindlist(out_list, fill = TRUE))

  out_final <- as.data.frame(out_df)
  out_final <- new_ohvbd.data.frame(df = out_final, db = "vd")

  return(out_final)
}

Try the ohvbd package in your browser

Any scripts or data that you put into this service are public.

ohvbd documentation built on March 10, 2026, 1:07 a.m.