source(file.path(usethis::proj_get(), "vignettes", "_common.R"))
Data science projects in commercial companies often experience a challenge arising from evolving data sources[^glossary-data-source]. As the project progresses, new signals and information sources are added incrementally. In practice, when the data source changes, it creates a need to change the application source code.
With no design up front, some analytic modules[^glossary-analytic-module], such as a dashboard or a machine learning model, unwittingly become dependent on the data source. In this case, accommodating the evolving data source is not simply a matter of changing the code related to the data source. Rather, preserving the rest of the existing application in a working condition involves further code changes in distant elements of the application.
An alternative way of dealing with evolving data sources is to introduce a small design up front. Such design lets data scientists manage the source code dependencies throughout the project life-cycle.
This post suggests a design that (1) separates data sources from analytic applications and (2) restricts analytic modules from knowing about the data sources.
While the evolving data sources challenge is programming language agnostic, this post demonstrates an implementation of the suggested design in R.
This talk features a confabulated story that covers real challenges.
We work in an Agile environment and we cover the first iteration, going from nothing to something.
Our client is Euel Cheatam the Merceduce dealership manager.
knitr::include_graphics('https://i.imgur.com/IcwSkWa.png')
Eual has been working in Mercedes dealership as a car salesman for the last 7 years. Today, Eual serves as the dealership manager. His main duty is to ensure a profitable, complaint, and effective dealership.
knitr::include_graphics('https://i.imgur.com/CMbZ8h7.png')
For the car salespeople who work at Mercedes dealership, the Factcedes™ is a weekly email service that provides a succinct and readable fact sheet with popular Q&A about the dealership vehicles. Unlike water cooler talks, our product has been carefully formulated and evaluated against veterans and is distributed regularly.
With the customer blessing and a firm handshake, we initiate product development.
knitr::include_graphics('https://i.imgur.com/TAcOHZO.png')
At this stage, the data scientists have a data-driven product in mind, but they are facing two challenges:
The first challenge occurs because the data scientists don’t know what information they need. Instead, they have assumptions about what information could be relevant. As the project progress, exploration will lead them to new findings and new data sources.
The second challenge occurs because of external factors such as personal or policy regarding the data. The people responsible for the database could be unavailable. Another impediment involves data security policy or something of that sort.
In any case, datasets are hard to obtain.
To mitigate what the real world imposes on us, we need a system design that can handle changes.
In iteration zero we are going to move from no product to a product. Once we have something, we can show it to the client and receive feedback at an early stage of development.
In many cases, including the case in hand, the analytic module can be satisfied with a tidy data table in its input.
A tidy data table [@wickham2016r, ch. 12] follows a consistent tabular format where each variable is a column and each observation is a row.
set.seed(1238) n <- 6 tibble::tibble( uid = head(1:26, n), x1 = head(letters, n), x2 = head(LETTERS, n), y = rpois(n, lambda = 5) )
The data scientist wants the tidy data table to support all of the necessary variables required by the analytic module, but does not know what all those variables are. However, the data scientist does know the basic intent of the analytic app.
Start by making an assumption triplet about what information the user needs to see.
An assumption triplet comprises hypotheses about three variable types that should be in the tidy data table: Observation Unique Identifier (UID), Target variable and Salient features.
In our example,
Having rudimentary assumptions about what the analytic module is likely to use, we write down assertions in a new module called data-tests.
# data-tests.R # 1. Check if the dataset exists stopifnot(exist("cars_data"), is.data.frame(cars_data)) # 2. Check if the necessary columns exist expected_cols <- c("car_model", "price", "gear", "mpg") stopifnot(all(expected_cols %in% colnames(cars_data))) # 3. Check if the records are unique is.distinct <- function(x) dplyr::n_distinct(x) == length(x) stopifnot(is.distinct(cars_data$car_model))
mtcars
datasetmtcars %>% head() %>% knitr::kable(caption = "The first 6 car models from `datasets::mtcars`", row.names = TRUE, digits = 0) %>% kableExtra::kable_styling(bootstrap_options = "striped", full_width = TRUE)
# data-access.R get_cars_data <- function(){ # 1. Generate records data(mtcars, package = "datasets") cars_data <- mtcars %>% tibble::rownames_to_column("car_model") # 2. Generate price set.seed(2020) price <- runif(n = nrow(cars_data), min = 41, max = 75) cars_data <- cars_data %>% tibble::add_column(price = price) # Run data-tests source("data-tests.R") return(cars_data) }
Notice that without conforming to data-tests assertions, the data source does not come into existence. That means the data source is dependent on data-tests. Recall that the assertions in data-tests are dictated by the intent of the analytic module. It is these assertions that know about the analytic module.
# app.R # 1. Get the data source("data-source.R") cars_data <- get_cars_data() # 2. Render booklet print(cars_data %>% dplyr::select(car_model, mpg, gear, price)) lm(mpg ~ hp, cars_data) %>% summary()
In short, change amplification and cognitive load.
Any analytic app has two essential parts: data source and analytic module. Importantly, in many applications the data source is independent of the analytic module, but the analytic module is dependent on the data source.
This type of relationship puts the data source at the center of the system. The analytic modules revolve around the data source. They are plugins of the data source.
The problem is that both the data source and the analytic module change throughout the project life cycle.
In data-centric configuration, changes in the data source propagate into the analytic module.
It is not obvious what data in the database is important for the app to run. if one is asked to replace a data source with another implementation, it is not obvious what )
The database, its query language, and even its schema are technical details that have nothing to do with the analytic module. They will change at rates, and for reasons, that are independent of other aspects of the system. Consequently, the data-tests separate the data source from the rest of the system so that they can be independently changed.
The suggested system design:
(1) Divides and encapsulates those parts into modules; (2) Separates both parts by introducing an intermediate module dubbed data-tests; and (3) Dictates that both parts must depend on the data-tests module.
The primary advantage of using data-tests is that the data source module and the analytic module do not know anything of each other. This allows those modules to evolve frequently and independently.
First, we identified the variables which the analytic module may need.
Second, we make wrote down our assumptions in data-tests.
Third, we implemented a data source that conforms to the data-tests.
Finally, we developed our application module using only the available information specified in data-tests.
[^glossary-data-source]: Data source includes data and access (or connection) to the data. The data itself could be obtained from databases, APIs, flat files, etc. [^glossary-analytic-module]: Analytic module contains the programming logic that process data extracted from a database. Prominent examples are dashboards, reports, and predictive modelling. [^synthetic-data]: A synthetic dataset contains data that is artificially created rather than being generated by actual events.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.