source("R/setup.R")$value
tmp_cache_dir <- file.path(tempdir(), "STATcubeR/open_data")
dir.create(tmp_cache_dir, recursive = TRUE)
od_cache_dir(tmp_cache_dir)

This article explains how and where r STATcubeR caches resources from r ogd_portal in the local file system. Understanding this behavior will allow you to enable persistent caches and directly use the cached resources.

Overview

By default, r STATcubeR caches all accessed resources from r ogd_portal in the temporary directory of the current R session.

od_cache_dir()

Let's examine for example what happens when the data from the structure of earnings survey (SES) is requested.

earnings <- od_table("OGD_veste309_Veste309_1")

First STATcubeR will grab a json with metadata about this dataset from https://data.statistik.gv.at/ogd/json?dataset=OGD_veste309_Veste309_1 and check which resources belong to it. For any resource, the attributes name and last_modified are extracted from the json. They are also included in the od_table object under $resources.

earnings$resources

last_modified tells us when the resource was changed on the fileserver. If a resource does not exist in the cache or if the last modified entry in the json is newer than the cached file, it will be downloaded from the server. Otherwise, the cached version is reused.

Access and Updates

Cached files can be accessed with od_cache_file(). If the specified file exists in the cache, a path to the file will be returned. Otherwise, the file is downloaded to the cache and then the path is returned. The files use the same naming conventions as the open data fileserver.

od_cache_file("OGD_veste309_Veste309_1")
od_cache_file("OGD_veste309_Veste309_1", "C-A11-0")

To read files from the cache as data.frames, use od_resource() with same parameters as in od_cache_file(). This will apply a special parser to the dataset which drops unneeded columns and normalizes column names.

od_resource("OGD_veste309_Veste309_1", "C-A11-0")

The parser behaves differently for header files, data files and fields. Json files can be accessed with od_json().

json <- od_json("OGD_veste309_Veste309_1")
unlist(json$tags)

Clearing and Changing

od_cache_clear(id) can be used to clear the cache from all files belonging to the passed dataset id. We saw that earnings$resources contains 7 rows, therefore 7 files will be deleted during cleanup.

od_cache_clear("OGD_veste309_Veste309_1")

If you want to use a persistent directory like ~/.cache/STATcubeR/open_data/ for caching, the directory can be changed with od_cache_dir(new).

od_cache_dir("~/.cache/STATcubeR/open_data/")

The resources field

Let's go back to the $resources field of earnings.

earnings$resources

We already looked at name and last_modified. The remaining columns can be interpreted as follows

What's in the cache?

od_cache_summary() will give an overview about all files that are available in the cache directory. The returned table contains one row for every dataset.

We can get a clear picture on how much disk space is used for each dataset.

od_cache_summary()

Note that od_cache_summary() only gathers information from the local file system based on filenames, file.mtime() and file.size().

od_cache_dir(tmp_cache_dir)

Download history

To get a history of all files that have been downloaded from the server, use od_downloads(). For each file, a timestamp for the download is recorded as well as the download time in milliseconds.

od_downloads()
unlink(tmp_cache_dir)


statistikat/STATcubeR documentation built on Dec. 3, 2024, 8:04 p.m.