SCDB: Basic principles"

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)
library(SCDB)
suppressPackageStartupMessages(library(SCDB))

# Setup conn to be used for examples
conn <- get_connection()

# Use a wrapper for update_snapshot which uses LoggerNull to suppress all logging
if (!"update_snapshot" %in% ls(envir = globalenv())) {
  update_snapshot <- function(...) return(SCDB::update_snapshot(logger = LoggerNull$new(), ...))
}

# Setup example_data table in conn
example_data <-
  dplyr::copy_to(conn,
                 dplyr::transmute(datasets::mtcars, car = rownames(mtcars), hp),
                 name = "example_data",
                 overwrite = TRUE)

The basic principle of the SCDB package is to enable the user to easily implement and maintain a database of time-versioned data.

In practice, this is done by labeling each record in the data with three additional fields:

This strategy of time versioning is often called "type 2" history [@Kimball2013].

Note that identical records may be removed and introduced more than once; for example, in a table of names and addresses, a person may change their address (or name) back to a previous value.

The SCDB package provides the function update_snapshot to handle the insertion and deactivation of records using this strategy. It further includes several functions to improve the Quality of life for working with database data.

A simple example of usage is shown below.
For this example, we use a temporary, on-disk SQLite database. Note that get_connection tries to establish connection using DBI::dbConnect with as few additional arguments as possible. Different drivers may require authentication which can be read from a configuration file.^[ In the context of the SCDB package, this is most notably RPostgres::Postgres(), which may read from a .pgpass file. See also the PostgreSQL documentation. ]

Our example data is datasets::mtcars reduced to only two columns: row names converted to a column car, and hp

if (!exists("conn")) conn <- get_connection()

example_data <- dplyr::tbl(conn, DBI::Id(table = "example_data"))
example_data

Imagine on Day 1, in this case January 1st, 2020, our currently available data is the first three records of the example_data. We then store this data in a table mtcars:

data <- head(example_data, 3)

update_snapshot(.data = data,
                conn = conn,
                db_table = "mtcars", # the name of the DB table to store the data in
                timestamp = as.POSIXct("2020-01-01 11:00:00"))

We can then access out data using the get_table function, and include information on data validity period using include_slice_info = TRUE:

get_table(conn, "mtcars")

get_table(conn, "mtcars", include_slice_info = TRUE)

Note that where e.g. dplyr::tbl requires a more exact specification of the table identity (tbl(conn, DBI::Id(table = "mtcars"))), get_table will parse any character to a DBI::Id object input using SCDB::id.

The following day, the current data is now the first five rows of our example data. We then store this data in the database using update_snapshot:

# Let's say that the next day, our data set is now the first 5 of our example data
data <- head(example_data, 5)

update_snapshot(.data = data,
                conn = conn,
                db_table = "mtcars", # the name of the DB table to store the data in
                timestamp = as.POSIXct("2020-01-02 12:00:00"))

We can again use the get_table function to see the latest available data, including time-keeping with include_slice_info = TRUE:

get_table(conn, "mtcars")

get_table(conn, "mtcars", include_slice_info = TRUE)

Since our data is time-versioned, we can recover the data from the day before

get_table(conn, "mtcars", slice_ts = "2020-01-01 11:00:00")

On day 3, we imagine that we have the same 5 records, but one of them is altered

data <- head(example_data, 5) |>
  dplyr::mutate(hp = ifelse(car == "Mazda RX4", hp / 2, hp))

update_snapshot(.data = data,
                conn = conn,
                db_table = "mtcars", # the name of the DB table to store the data in
                timestamp = as.POSIXct("2020-01-03 10:00:00"))

We can again access our data using the get_table function and see that the currently available data (with the changed hp value for Mazda RX4)

get_table(conn, "mtcars")

Finally, using slice_ts = NULL, the full history (and time-keeping information) is returned:

get_table(conn, "mtcars", slice_ts = NULL)

References



Try the SCDB package in your browser

Any scripts or data that you put into this service are public.

SCDB documentation built on Oct. 4, 2024, 1:09 a.m.