knitr::opts_chunk$set( collapse = TRUE, comment = "#>" )
library(SCDB)
suppressPackageStartupMessages(library(SCDB)) # Setup conn to be used for examples conn <- get_connection() # Use a wrapper for update_snapshot which uses LoggerNull to suppress all logging if (!"update_snapshot" %in% ls(envir = globalenv())) { update_snapshot <- function(...) return(SCDB::update_snapshot(logger = LoggerNull$new(), ...)) } # Setup example_data table in conn example_data <- dplyr::copy_to(conn, dplyr::transmute(datasets::mtcars, car = rownames(mtcars), hp), name = "example_data", overwrite = TRUE)
The basic principle of the SCDB package is to enable the user to easily implement and maintain a database of time-versioned data.
In practice, this is done by labeling each record in the data with three additional fields:
This strategy of time versioning is often called "type 2" history [@Kimball2013].
Note that identical records may be removed and introduced more than once; for example, in a table of names and addresses, a person may change their address (or name) back to a previous value.
The SCDB package provides the function update_snapshot
to handle the insertion and deactivation of records using this strategy.
It further includes several functions to improve the Quality of life for working with database data.
A simple example of usage is shown below.
For this example, we use a temporary, on-disk SQLite database.
Note that get_connection
tries to establish connection using DBI::dbConnect
with as few additional arguments as possible.
Different drivers may require authentication which can be read from a configuration file.^[
In the context of the SCDB package, this is most notably RPostgres::Postgres()
, which may read from a .pgpass
file.
See also the PostgreSQL documentation.
]
Our example data is datasets::mtcars
reduced to only two columns: row names converted to a column car
, and hp
if (!exists("conn")) conn <- get_connection() example_data <- dplyr::tbl(conn, DBI::Id(table = "example_data")) example_data
Imagine on Day 1, in this case January 1st, 2020, our currently available data is the first three records of the example_data.
We then store this data in a table mtcars
:
data <- head(example_data, 3) update_snapshot(.data = data, conn = conn, db_table = "mtcars", # the name of the DB table to store the data in timestamp = as.POSIXct("2020-01-01 11:00:00"))
We can then access out data using the get_table
function, and include information on data validity period using include_slice_info = TRUE
:
get_table(conn, "mtcars") get_table(conn, "mtcars", include_slice_info = TRUE)
Note that where e.g. dplyr::tbl
requires a more exact specification of the table identity (tbl(conn, DBI::Id(table = "mtcars"))
), get_table
will parse any character to a DBI::Id
object input using SCDB::id
.
The following day, the current data is now the first five rows of our example data. We then store this data in the database using update_snapshot:
# Let's say that the next day, our data set is now the first 5 of our example data data <- head(example_data, 5) update_snapshot(.data = data, conn = conn, db_table = "mtcars", # the name of the DB table to store the data in timestamp = as.POSIXct("2020-01-02 12:00:00"))
We can again use the get_table
function to see the latest available data, including time-keeping with include_slice_info = TRUE
:
get_table(conn, "mtcars") get_table(conn, "mtcars", include_slice_info = TRUE)
Since our data is time-versioned, we can recover the data from the day before
get_table(conn, "mtcars", slice_ts = "2020-01-01 11:00:00")
On day 3, we imagine that we have the same 5 records, but one of them is altered
data <- head(example_data, 5) |> dplyr::mutate(hp = ifelse(car == "Mazda RX4", hp / 2, hp)) update_snapshot(.data = data, conn = conn, db_table = "mtcars", # the name of the DB table to store the data in timestamp = as.POSIXct("2020-01-03 10:00:00"))
We can again access our data using the get_table
function and see that the currently
available data (with the changed hp value for Mazda RX4)
get_table(conn, "mtcars")
Finally, using slice_ts = NULL
, the full history (and time-keeping information) is returned:
get_table(conn, "mtcars", slice_ts = NULL)
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.