Let's try some "ugly corner cases" for data manipulation in R
. Corner cases are examples where the user might be running to the edge of where the package developer intended their package to work, and thus often where things can go wrong.
Let's see what happens when we try to stick a fork in the power-outlet.
For our example let's set up some data.frame
s.
# create some exmaple tables d1 <- data.frame(v1 = c("a", "b"), stringsAsFactors = FALSE) d2 <- data.frame(v2 = c("x", "y"), stringsAsFactors = FALSE) d3 <- data.frame(x = 1, y = 2, z = 3)
And we will also copy this data into a database.
# copy the example tables to a database db <- DBI::dbConnect(RSQLite::SQLite(), ":memory:") DBI::dbWriteTable(db, "d1d", d1) DBI::dbWriteTable(db, "d2d", d2) DBI::dbWriteTable(db, "d3d", d3)
Now let's try a couple of basic tasks using dplyr
.
# try to use dplyr to work with the original tables # in memory suppressPackageStartupMessages(library("dplyr")) packageVersion("dplyr")
First we try a cross-join.
# try to do a cross-join left_join(d1, d2, by = character(0)) # join is rejected
Obviously that is something deliberately prohibited by the package authors.
Let's try dividing all columns by a value.
# try to divide all columns by y mutate_all(d3, ~(./y)) # z is divided by 1, not by the original # y-value
Something went wrong, notice z
was not altered. We will return to this later.
Now let's try the same operations using dplyr
on a database.
# try the same calculations in database # using dplyr packageVersion("dbplyr") # get references to the database tables d1d <- dplyr::tbl(db, "d1d") d2d <- dplyr::tbl(db, "d2d") d3d <- dplyr::tbl(db, "d3d") # try to do a cross-join left_join(d1d, d2d, by = character(0)) # this time it works
In this case the cross join works (as databases expect to support this operation).
Okay, let's try the divide by columns example again.
# try to divide all columns by y mutate_all(d3d, ~(./y)) # this time it errors-out
Boom (ouch).
Now let's try that again with the rquery
package.
# try the join task using rquery library("rquery") # get references to the database tables d1r <- rquery::db_td(db, "d1d") d2r <- rquery::db_td(db, "d2d") d3r <- rquery::db_td(db, "d3d") # try the cross-join ops <- natural_join(d1r, d2r, by = character(0), jointype = "LEFT") execute(db, ops)
The cross-join worked in the database.
Now we start on the column division. We have to specify
the set of operations explicitly, but that really isn't too
big of burden (as rquery
supplies tools for this).
# try to get column alterations vars <- colnames(d3) expressions <- paste(vars, "%:=%", vars, "/", "y") print(expressions) ops2 <- extend_se(d3r, expressions) cat(format(ops2)) # Oh! This isn't what I want, see # how y gets updated before it is used on column z.
We took the extra step of examining the operations before
we tried them. Notice the rquery::extend()
operation
got factored into explicit steps. This introduces inspectable guarantees
as to what value each column has at each step. This is enough
to show us that the y
column will be all ones before it is used
to try and adjust the z
column. This is not what we wanted, though
this query is guaranteed to have similar semantics both on-database and
in-memory.
Now let's build, examine and execute a better version of the query.
# try to get column alterations again vars <- setdiff(colnames(d3), "y") expressions <- paste(vars, "%:=%", vars, "/", "y") print(expressions) ops2b <- d3r %.>% extend_se(., expressions) %.>% extend(., y = 1) cat(format(ops2b)) execute(db, ops2b)
With the rqdatatable
package we can use data.table
to process
in-R data.
# try the join task using rqdatatable library("rqdatatable") # immediate notation natural_join(d1, d2, by = character(0)) # join task in operators notation ex_data_table(ops, tables = list(d1d = d1, d2d = d2)) # column alterations in operators notation ex_data_table(ops2b, tables = list(d3d = d3))
And there we are. Frankly we didn’t really have to stretch things very far to break things (including building up non-signalling data mistakes; inspect your intermediate results!!).
We have been trying to keep the number of unexpected behaviors in rquery down by keeping the rquery implementation very simple (and very thin) and then relying on either the database or data.table for the actual implementation and semantics. There are, of course, going to be cases where rquery needs a fix- but we have been able to find and apply such fixes quite quickly. We have also found fixing rquery is much faster than coding around bugs.
# selecting rows using set membership d1 %.>% set_indicator(., "want", "v1", c('a')) d1 %.>% set_indicator(., "want", "v1", c()) d1r %.>% set_indicator(., "want", "v1", c('a')) %.>% execute(db, .) d1r %.>% set_indicator(., "want", "v1", c()) %.>% execute(db, .) d1 %>% filter(v1 %in% c('a')) d1 %>% filter(v1 %in% c()) d1d %>% filter(v1 %in% c('a')) d1d %>% filter(v1 %in% c()) # https://github.com/tidyverse/dplyr/issues/3375 fixed
# clean up DBI::dbDisconnect(db)
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.