In WinVector/rquery: Relational Query Generator for Data Manipulation at Scale

Example of data transforms as categorical arrows (R version Python version).

(For ideas on applying category theory to science and data, please see David I Spivak, Category Theory for the Sciences, MIT Press, 2014.)

The R rquery package supplies a number of operators for working with tabular data. The operators are picked in reference to Codd's relational algebra, though (as with SQL) we do not insist on table rows being unique. Many of the operations are simple: selecting rows, selecting columns, joining tables. Two of the operations stand out: projecting or aggregating rows, and extending tables with new derived columns.

An interesting point is: while the rquery operators are fairly generic: the operator pipelines that map a single table to a single table form the arrows of a category over a nice set of objects.

The objects of this category can be either of:

Sets of column names.
Maps of column names to column types (schema-like objects).

I will take a liberty and call these objects (with or without types) "single table schemas."

Our setup is easiest to explain with an example. Let's work an example in R.

First we import our packages and instantiate an example data frame.

library(wrapr)
library(rquery)
library(rqdatatable)

d <- data.frame(
    'g' = c('a', 'b', 'b', 'c', 'c', 'c'),
    'x' = c(1, 4, 5, 7, 8, 9),
    'v' = c(10, 40, 50, 70, 80, 90),
    'i' = c(TRUE, TRUE, FALSE, FALSE, FALSE, FALSE)
)

knitr::kable(d)

rquery operator pipelines are designed to transform data. For example we can define the following operator pipeline which is designed count how many different values there are for g, and assign a unique integer id to each group.

table_description <- local_td(d)

id_ops_a = table_description %.>%
    project(., groupby='g') %.>%
    extend(.,
        ngroup := row_number())

The pipeline is saved in the variable id_ops_a which can then be applied to our data as follows.

d %.>% 
  id_ops_a %.>% 
  knitr::kable(.)

The pipelines are designed for composition in addition to application to data. For example we can use the id_ops_a pipeline as part of a larger pipeline as follows.

id_ops_b = table_description %.>%
    natural_join(., b=id_ops_a, by='g', jointype='LEFT')

This pipeline specifies joining the integer group ids back into the original table as follows.

d %.>% 
  id_ops_b %.>% 
  knitr::kable(.)

Notice the ngroup column is a function of the g column in this result.

I am now ready to state my big point. These pipelines have documented pre and post conditions: what set of columns (and optionally types) they expect on their input, and what set of columns (optionally types) the pipeline produces.

# needs
columns_used(id_ops_b)

# produces
column_names(id_ops_b)

This is where we seem to have nice opportunity to use category theory to manage our pre-and post conditions. Let's wrap this pipeline into a convenience class to make the categorical connection easier to see.

a1 = arrow(id_ops_b)
print(a1)

a1 is a categorical theory arrow, it has the usual domain (arrow base, or incoming object), and co-domain (arrow head, or outgoing object) in a category of single-table schemas.

# dom or domain
a1$incoming_columns

# cod or co-domain
a1$outgoing_columns

These are what are presented in the succinct presentation of the arrow.

print(a1)

The arrow has a more detailed presentation, which is the realization of the operator pipeline as code.

print(a1, verbose = TRUE)

We can think of our arrows (or obvious mappings of them) as being able to be applied to: * More arrows of the same type (composition). * Data (action or application). * Single table schemas (managing pre and post conditions).

Arrows can be composed or applied by using the notation d %.>% a1. Note: we are not thinking of %.>% itself as an arrow, but as a symbol for composition of arrows.

d %.>%
  a1 %.>%
  knitr::kable(.)

Up until now we have been showing how we work to obey the category theory axioms. From here on we look at what does category theory do for us. What it does is check correct composition and ensure full associativity of operations.

As is typical in category theory, there can be more than one arrow from a given object to given object. For example the following is a different arrow with the same start and end.

a1b <- arrow(
  table_description %.>%
    extend(.,
           ngroup := 0)
)

print(a1b)

However, the a1b arrow represents a different operation than a1:

d %.>%
  a1b %.>%
  knitr::kable(.)

The arrows can be composed exactly when the pre-conditions meet the post conditions.

Here are two examples of violating the pre and post conditions. The point is, the categorical conditions enforce the checking for us. We can't compose arrows that don't match domain and range. Up until now we have been setting things up to make the categorical machinery work, now this machinery will work for us and make the job of managing complex data transformations easier.

shift <- data.table::shift

# too small
ordered_ops = mk_td('d2', c("g", "x", "v", "ngroup")) %.>%
    extend(., 
        row_number := row_number(),
        v_shift := shift(v),
    orderby='x',
    partitionby='g')

a2 = arrow(ordered_ops)
print(a2)

a1 %.>% a2

# too big
ordered_ops = mk_td('d2', c("g", "x", "v", "i", "ngroup", "q")) %.>%
    extend(., 
        row_number := row_number(),
        v_shift := shift(v),
    orderby='x',
    partitionby='g')

a2 = arrow(ordered_ops)
print(a2)

a1 %.>% a2

The point is: we will never see the above exceptions when we compose arrows that match on pre and post conditions (which in category theory are the only arrows you are allowed to compose).

When the pre and post conditions are met the arrows compose in a fully associative manner.

ordered_ops = mk_td('d2', colnames(id_ops_b)) %.>%
    extend(., 
        row_number := row_number(),
        v_shift := shift(v),
    orderby='x',
    partitionby='g')

a2 = arrow(ordered_ops)
print(a2)

a1 %.>% a2

print(
  a1 %.>% a2)

We can add yet another set of operations to our pipeline: computing a per-group variable mean.

unordered_ops = mk_td('d3', colnames(ordered_ops)) %.>%
    extend(.,
        mean_v := mean(v),
    partitionby='g')


a3 = arrow(unordered_ops)
print(a3)

The three arrows can form a composite pipeline that computes a number of interesting per-group statistics all at once.

a1 %.>% a2 %.>% a3

And, we the methods are fully associative (can be grouped in any sequence that is still in the original order).

ops1 <- (a1 %.>% a2) %.>% a3
ops1

(Note: we are using the .() notation to signal the expression a2 %.>% a3 is to be evaluated before being treated as a step in the wrapr pipeline. This is required as this R-pipe has delayed evaluation of the right arguments.)

ops2 <- a1 %.>% .(a2 %.>% a3)

All the compositions are in fact the same arrow, as we can see by using it on data.

d %.>%
  ops1 %.>% 
  knitr::kable(.)

d %.>%
  ops2 %.>% 
  knitr::kable(.)

The combination operator %.>% is fully associative over the combination of data and arrows.

The underlying rquery steps compute and check very similar pre and post conditions, the arrow class is just making this look more explicitly like arrows moving through objects in category.

The data arrows operate over three different value domains: