In WinVector/cdata: Fluid Data Transformations

John Mount and Nina Zumel

Introduction

This is an example of an interesting data-wrangling/data-reshaping problem and how to solve it in R using rqdatatable/cdata. The Python version of this example can be found here.

The problem

In an RStudio Community note, user hklovs asked how to re-organize some data. The ask was essentially to transform data from this format:

d <- wrapr::build_frame(
  "ID"  , "OP", "DATE"                |
    1   , "A" , "2001-01-02 00:00:00" |
    1   , "B" , "2015-04-25 00:00:00" |
    2   , "A" , "2000-04-01 00:00:00" |
    3   , "D" , "2014-04-07 00:00:00" |
    4   , "C" , "2012-12-01 00:00:00" |
    4   , "A" , "2005-06-16 00:00:00" |
    4   , "D" , "2009-01-20 00:00:00" |
    4   , "B" , "2009-01-20 00:00:00" |
    5   , "A" , "2010-10-10 00:00:00" |
    5   , "B" , "2003-11-09 00:00:00" |
    6   , "B" , "2004-01-09 00:00:00" )

knitr::kable(d)

Into this format:

library(wrapr)
library(rquery)
library(rqdatatable)
library(cdata)

# draw a picture of the record format
diagram <- wrapr::qchar_frame(
  "rank", "DATE", "OP" |
    "1",   DATE1, OP1  |
    "2",   DATE2, OP2  |
    "3",   DATE3, OP3  )

transform <- blocks_to_rowrecs_spec(
  controlTable = diagram,
  recordKeys = 'ID')

# a function to paste a vector of strings together
concat_values = function(v) {
  paste(sort(unique(v)), collapse=", ")
}

ops <- local_td(d) %.>%
  project(.,
          OP := concat_values(OP),    
          groupby = c("ID", "DATE")) %.>%
  extend(.,
         rank %:=% row_number(),
         partitionby = "ID",
         orderby = "DATE") %.>%
  transform %.>%
  orderby(.,
          'ID')

res <- d %.>% ops
knitr::kable(res)

That is: for each ID pick the first three operations ordered by date, merging operations with the same timestamp. Then write these results into a single row for each ID.

The solution

A good way to solve any data-wrangling problem is to:

Get a good definition of what is wanted
Re-process the data so any advisory column you wished you had is actually there
And finish the problem.

Let's apply this process to our example problem.

Adding an advisory rank column

The first step isn't to worry about the data format, as it is an inessential or solvable difficulty. Instead make any extra descriptions or controls you need explicit. In this case we need to date-rank and to merge the operations (per ID). So let's do that first.

# bring in all of our packages
library(wrapr)
library(rquery)
library(rqdatatable)
library(cdata)

# some example data
d <- wrapr::build_frame(
  "ID"  , "OP", "DATE"                |
    1   , "A" , "2001-01-02 00:00:00" |
    1   , "B" , "2015-04-25 00:00:00" |
    2   , "A" , "2000-04-01 00:00:00" |
    3   , "D" , "2014-04-07 00:00:00" |
    4   , "C" , "2012-12-01 00:00:00" |
    4   , "A" , "2005-06-16 00:00:00" |
    4   , "D" , "2009-01-20 00:00:00" |  # this and the next row
    4   , "B" , "2009-01-20 00:00:00" |  # are on the same date
    5   , "A" , "2010-10-10 00:00:00" |
    5   , "B" , "2003-11-09 00:00:00" |
    6   , "B" , "2004-01-09 00:00:00" )


# a function to paste a vector of strings together
concat_values = function(v) {
  paste(sort(unique(v)), collapse=", ")
}

# merge the operations to get one row per ID and DATE
# then rank the rows for each ID by DATE
ops <- local_td(d) %.>%
  project(.,  # fuse all the ops on same date/id into one string
          OP := concat_values(OP),  
          groupby = c("ID", "DATE")) %.>%
  extend(.,   # rank each ID group in order of date
         rank %:=% row_number(),
         partitionby = "ID",
         orderby = "DATE")

# apply the steps to the data
d2 <- d %.>% ops

# show the intermediate results.
knitr::kable(d2)

In the above code we used the project() operator to merge rows with duplicate ID and DATE into a single string listing all the operations that occurred, for example "B, D". Then we added a rank column. This gives us all the information we need for a complete solution to the original problem. Now all we have to do is re-arrange the data.

Reshaping the data

To reshape the data, we strongly encourage drawing out what one wants it terms of one input record and one output record. With cdata doing so essentially solves the problem.

So let's look at what happens only to the rows with ID == 1. In this case (after some column re-ordering, and row-padding) we expect an input block record of rows that look like this:

r_example <- as.data.frame(d2[d2$ID==1, , drop=FALSE])
r_example <- r_example[qc(ID, rank, DATE, OP)]
r_example <- rbind(r_example,
                   data.frame(ID = 1, rank = 3, DATE = NA, OP = NA))
knitr::kable(r_example)

And we want this record transformed into this:

knitr::kable(res[res$ID==1, , drop=FALSE])

We call the above record form a row record, because all the data for a given ID is in a single row. When the data for a given ID is not in a single row, we say it is in a block. In addition to having a per-record key (ID in our example), each row of a block is uniquely identified by an in-record structure key (in this case, rank).

cdata moves records from row shaped to block shaped, and vice-versa (It can also move data from one block shape to another, by going through a row).

To use cdata, draw a picture of any block record in its full generality. In our case the interesting record is the input shape, which looks like the following (with the record ID columns suppressed for conciseness).

# draw a picture of the record format
diagram <- wrapr::qchar_frame(
  "rank", "DATE", "OP" |
    "1",   DATE1, OP1  |
    "2",   DATE2, OP2  |
    "3",   DATE3, OP3  )

knitr::kable(diagram)

The column names rank, DATE, and OP are all column names of the table we are starting with. The values 1, 2, and 3 are all values we expect to see in the rank column of the working data frame. And the symbols DATE1, DATE2, DATE3, OP1, OP2, and OP3 are all stand-in names for values we see in our data. These symbols will be the column names of our new row-records.

By default, the first column of a diagram is the in-record key (that is why we put rank first). However, any set of columns can be specified as the in-record keys through the package interfaces.

We have tutorials on how to build these diagrams here and here. Essentially we draw one record of the input and output and match column names to stand-in interior values of the other. The output record is a single row, so we don't have to explicitly pass it in. However it looks like the following.

row_record <- wrapr::qchar_frame(
  "DATE1", "OP1", "DATE2", "OP2", "DATE3", "OP3" |
   DATE1 ,  OP1 ,  DATE2 ,  OP2 ,  DATE3 ,  OP3  )

knitr::kable(row_record)

Notice the interior-data portions (the parts we wrote in the inputs as unquoted) of each table input are the cells that are matched from one record to the other. These are in fact just the earlier sample inputs and outputs with the values replaced with the placeholders DATE1, DATE2, DATE3, OP1, OP2, and OP3.

We have a two minute video of how to work out the placeholder labels and build the diagram here.

With the diagram in hand we can specify the data reshaping step. Since we are moving the data from blocks to row records, we use the function blocks_to_rowrecs_spec to create the reshaping transform.

transform <- blocks_to_rowrecs_spec(
  controlTable = diagram, # data frame describing the block
  recordKeys = 'ID')

The transform specifies that records are found in the format shown in diagram, and are to be converted to rows. We can confirm the intent by printing the transform.

print(transform)

If we apply this transform to the intermediate table d2, we have the data in the format we need (except possibly for the order of ID).

d2 %.>% # this MUST be the wrapr dot-pipe
  transform %.>%
  knitr::kable(.)

# if you prefer not to use a pipe:
# layout_by(transform, d2)

The full transformation

We are now ready to put all of our operations together into one composite pipeline, starting from a specification of the original data d.

# specify the operations 
ops <- local_td(d) %.>%
  project(.,  # fuse all the ops on same date/id into one string
          OP := concat_values(OP),
          groupby = c("ID", "DATE")) %.>%
  extend(.,  # rank each ID group in order of date
         rank %:=% row_number(),
         partitionby = "ID",
         orderby = "DATE") %.>%
  transform %.>%  # transform the record shape
  orderby(.,  # ensure presentation is ordered by ID
          'ID')

# apply the operations to data
res <- d %.>% ops

# present the results
knitr::kable(res)

We could also try this with an alternate concatenation that builds lists (instead of concatenating strings).

# a function to collect values in a sorted list
concat_values2 = function(v) {
  list(sort(unique(v)))
}


# specify the operations 
ops2 <- local_td(d) %.>%
  project(.,  # fuse all the ops on same date/id into one string
          OP := concat_values2(OP),
          groupby = c("ID", "DATE")) %.>%
  extend(.,  # rank each ID group in order of date
         rank %:=% row_number(),
         partitionby = "ID",
         orderby = "DATE") %.>%
  transform %.>%  # transform the record shape
  orderby(.,  # ensure presentation is ordered by ID
          'ID')

# apply the operations to data
res2 <- d %.>% ops2

# present the results
knitr::kable(res2)

And we are done.

A variation

If we had not wanted to merge ties, the solution would look like this:

# specify the operations 
ops <- local_td(d) %.>%
  extend(., # now we have to order by Date AND op
         rank %:=% row_number(),
         partitionby = "ID",
         orderby = c("DATE", "OP")) %.>%
  transform %.>%  # transform the record shape
  orderby(.,  # ensure presentation is ore-red by ID
          'ID')

# apply the operations to data
res <- d %.>% ops

# present the results
knitr::kable(res)

Currently in rquery/R all of the steps except the concat_values step are easy to translate into SQL. So this variation (which does not include the problematic aggregation step) is easy to translate into SQL for use in databases.

# get an example database connection
raw_connection <- DBI::dbConnect(RSQLite::SQLite(), ":memory:")
RSQLite::initExtension(raw_connection)
db <- rquery_db_info(
  connection = raw_connection,
  is_dbi = TRUE,
  connection_options = rq_connection_tests(raw_connection))

# copy data into database
rquery::rq_copy_to(db, table_name = 'd', d)

# run the query and land the result in the database
# WITHOUT round-tripping the data through R
rquery::materialize(db, ops, table_name = 'res')

# read the result so we can see it
res_db <- DBI::dbReadTable(raw_connection, 'res')

# close the DB connection
DBI::dbDisconnect(raw_connection)

# present the results
knitr::kable(res_db)

Note: column order is not considered essential in rquery pipelines (though it is easy to fix once you are in R).

res_db <- res_db[qc(ID, DATE1, OP1, DATE2, OP2, DATE3, OP3)]

knitr::kable(res_db)

Also, the entire query can be run in a database with the correct user-specified aggregation function. We have a demonstration of this in our Python version of this example.

WinVector/cdata documentation built on Sept. 29, 2024, 5:04 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

Tweet to @rdrrHQ

GitHub issue tracker

ian@mutexlabs.com