In WinVector/rquery: Relational Query Generator for Data Manipulation at Scale

`rquery` Join Controller

This note (an update of our 2017 article) and describes a useful tool we call a "join controller" (we discuss this further in our dependency sorting note).

When working on real world predictive modeling tasks in production, the ability to join data and document how you join data is paramount. There are very strong reasons to organize production data in something resembling one of the Codd normal forms. However, for machine learning we need a fully denormalized form (all columns populated into a single to ready to go row, no matter what their provenance, keying, or stride).

This is not an essential difficulty as in relational data systems moving between these forms can be done by joining, and data stores such as PostgreSQL or Apache Spark are designed to provide powerful join capabilities.

However there are some inessential (in that they can be avoided) but substantial difficulties in managing and documenting long join plans. It is not uncommon to have to join 7 or more tables to get an analysis ready. This at first seems trivial, but once you add in the important tasks of narrowing tables (eliminating columns not used later) and disambiguating column names (ensuring unique names after the join) the documentation and work can become substantial. Specifying the join process directly in R code leads to hard to manage, hard to inspect, and hard to share spaghetti code (even when using a high-level data abstraction such as dplyr).

If you have done non-trivial work with production data you have seen this pain point.

The fix is to apply the following principles:

Anything long, repetitive, and tedious should not be done directly.
Moving specification out of code and into data is of huge benefit.
A common special case can be treated separately, as that documents intent.

To supply such a solution the development version of rquery now supplies a item called a "join controller" under the method rquery::executeLeftJoinPlan().

This is easiest to explain through a concrete example, which is what we will do here.

First let's load the needed packages.

execute_vignette <- requireNamespace("RSQLite", quietly = TRUE) &&
  requireNamespace("DiagrammeR", quietly = TRUE)

# load packages
library("rquery")

Now let's load some notional example data. For our example we have:

One primary table of measurements (called "meas1") keyed by id and date.
A fact table that maps ids to patient names (called "names", and keyed by id).
A second table of additional measurements (called "meas2") That we consider "nice to have." That is: rows missing from this table should not censor-out meas1 rows, and additional rows found here should not be included in the analysis.

The data is given below:

# load notional example data
raw_connection <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
RSQLite::initExtension(raw_connection)
my_db <- rquery_db_info(
  connection = raw_connection,
  is_dbi = TRUE,
  connection_options = rq_connection_tests(raw_connection))

# example data
DBI::dbWriteTable(raw_connection,
                  'meas1_train',
                  data.frame(id= c(1,1,2,2),
                             date= c(1,2,1,2),
                             weight= c(200, 180, 98, 120),
                             height= c(60, 54, 12, 14)))
DBI::dbWriteTable(raw_connection,
                  'names_facts',
                  data.frame(id= seq_len(length(letters)),
                             name= letters,
                             stringsAsFactors=FALSE))

DBI::dbWriteTable(raw_connection,
                  'meas2_train',
                  data.frame(pid= c(2,3),
                             date= c(2,2),
                             weight= c(105, 110),
                             width= 1))

An important (and very neglected) step in data science tasks is documenting roles of tables, especially their key-structure (which we also call "stride" in the sense it describes how you move from row to row). rquery::describe_tables() is a function that builds an initial description of the tables.

# map from abstract names to realized names
trainTables <- c('meas1_train', 
                 'names_facts', 
                 'meas2_train')

tDesc <- describe_tables(my_db, trainTables,
                         keyInspector = key_inspector_sqlite)

tDesc is essentially a slightly enriched version of the data handle concordance described in "Managing Spark data handles in R." We can take a quick look at the stored simplified summaries:

knitr::kable(tDesc)

tableDescription() produces tables that hold the following:

tableName: the abstract name we wish to use for this table.
isEmpty: an advisory column indicating if any rows were present when we looked.
indicatorColumn: a name to indicate if the table contributed a row in later joins.
columns: the list of columns in the table.
keys: a named list mapping abstract key names to table column names. The set of keys together is supposed to uniquely identify rows.
colClasses: a vector of column classes of the underlying table.

The intent is to build a detailed join plan (describing order, column selection, and column re-naming) from the tDesc table. We can try this with the supplied function buildJoinPlan(), which in this case tells us our table descriptions are not quite ready to specify a join plan:

tryCatch(
  build_join_plan(tDesc),
  error = function(e) {e}
)

In the above the keys column is wrong in that it claims every column of each table is a table key. The join plan builder noticed this is unsupportable in that when it comes time to join the "names" table not all of the columns that are claimed to be "names" keys are already known from previous tables. That is: the "names$name" column is present in the earlier tables, and so can not be joined on. We can't check everything, but the join controller tries to "front load" or encounter as many configuration inconsistencies early- before any expensive steps have been started.

The intent is: the user should edit the "tDesc" keys column and share it with partners for criticism. In our case we declare the primary of the measurement tables to be PatientID and MeasurementDate, and the primary key of the names table to be PatientID. Notice we do this by specifying named lists or vectors mapping desired key names to names actually used in the tables.

# declare keys (and give them consistent names)
tDesc$keys[[1]] <- c(PatientID= 'id', MeasurementDate= 'date')
tDesc$keys[[2]] <- c(PatientID= 'id')
tDesc$keys[[3]] <- c(PatientID= 'pid', MeasurementDate= 'date')

print(tDesc$keys)

The above key mapping could then be circulated to partners for comments and help. Notice since this is not R code we can easily share it with non-R users for comment and corrections.

Once we are satisfied with our description of tables we can build a join plan. The join plan is an ordered sequence of left-joins.

In practice, when preparing data for predictive analytics or machine learning there is often a primary table that has exactly the set of rows you want to work over (especially when encountering production star-schemas). By starting joins from this table we can perform most of our transformations using only left-joins. To keep things simple we have only supplied a join controller for this case. This is obviously not the only join pattern needed; but it is the common one.

A join plan can now be built from our table descriptions:

# build the column join plan
columnJoinPlan <- build_join_plan(tDesc)
knitr::kable(columnJoinPlan)

Essentially the join plan is an unnest of the columns from the table descriptions. This was anticipated in our article "Managing Spark Data Handles".

We then alter the join plan to meet our needs (either through R commands or by exporting the plan to a spreadsheet and editing it there).

Only columns named in the join plan with a value of TRUE in the want column are kept in the join (columns marked isKey must also have want set to TRUE). This is very useful as systems of record often have very wide tables (with hundreds of columns) of which we only want a few columns for analysis.

For example we could decide to exclude the width column by either dropping the row or setting the row's want column to FALSE.

Since we have edited the join plan it is a good idea to both look at it and also run it through the inspectDescrAndJoinPlan() to look for potential inconsistencies.

# decide we don't want the width column
columnJoinPlan$want[columnJoinPlan$resultColumn=='width'] <- FALSE
# double check our plan
if(!is.null(inspect_join_plan(tDesc, columnJoinPlan))) {
  stop("bad join plan")
}
knitr::kable(columnJoinPlan)

The join plan is the neglected (and often missing) piece of documentation key to non-trivial data science projects. We strongly suggest putting it under source control, and circulating it to project partners for comment.

As a diagram the key structure of the join plan looks like the following:

columnJoinPlan %.>%
  graph_join_plan(.) %.>%
  DiagrammeR::grViz(.)

Once you have a good join plan executing it easy.

optree <- actualize_join_plan(columnJoinPlan,
                              add_ind_cols = TRUE)

cat(format(optree))

optree %.>%
  op_diagram(., merge_tables = TRUE) %.>%
  DiagrammeR::grViz(.)

res <- execute(my_db, optree)
str(res)

A good workflow is:

Build simple procedures to build up tDesc.
Work hard to get a good columnJoinPlan.
Save columnJoinPlan in source control and re-load it (not re-build it) when you need it.
Re-build new tDesc compatible with the saved columnJoinPlan later when you need to work with tables (note only the columns tableName and handle are used during join execution, so you only need to create those).

Having to "join a bunch of tables to get the data into simple rows" is a common step in data science. Therefore you do not want this to be a difficult and undocumented task. By using a join controller you essentially make the documentation the executable specification for the task.

# cleanup
DBI::dbDisconnect(raw_connection)

Related work:

WinVector/rquery documentation built on Aug. 24, 2023, 11:12 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

WinVector/rquery
Relational Query Generator for Data Manipulation at Scale

In WinVector/rquery: Relational Query Generator for Data Manipulation at Scale

`rquery` Join Controller

R Package Documentation

Browse R Packages

We want your feedback!

WinVector/rquery Relational Query Generator for Data Manipulation at Scale

In WinVector/rquery: Relational Query Generator for Data Manipulation at Scale

rquery Join Controller

R Package Documentation

Browse R Packages

We want your feedback!

WinVector/rquery
Relational Query Generator for Data Manipulation at Scale

`rquery` Join Controller