rquery Many Columns

A common data manipulation need is: applying the same operation to a number of columns.

In the rquery R package we strongly recommend using value-oriented (or standard evaluation) for this task.

Here is what this looks like.

For our data lets build a simple data set similar to mtcars:

library(wrapr)
library(rquery)
have_rqdatatable <- requireNamespace('rqdatatable', quietly = TRUE)

example_data <- wrapr::build_frame(
   "mpg"  , "cyl", "disp",  "car"               |
     21   , 6    , 160   ,  "Mazda RX4"         |
     21   , 6    , 160   ,  "Mazda RX4 Wag"     |
     22.8 , 4    , 108   ,  "Datsun 710"        |
     21.4 , 6    , 258   ,  "Hornet 4 Drive"    |
     18.7 , 8    , 360   ,  "Hornet Sportabout" |
     18.1 , 6    , 225   ,  "Valiant"           )

knitr::kable(example_data)

Now suppose for a number of columns we wish to perform a calculation, such centering it with respect to the grand average.

This is easily handled by first specifying the set of variables we wish to work with.

vars <- setdiff(colnames(example_data), 'car')

print(vars)

Now we build up what we want as a name-vector of strings using the := named map builder.

expressions <- vars := paste0(vars, ' - mean(', vars, ')')

print(expressions)

The idea is: the := operator fits into R idiom by looking very much like a vectorized version of "names get assigned expressions".

These expressions can then be used in an rquery pipeline using the _se-variant of extend(): extend_se().

ops <- local_td(example_data) %.>%
  extend_se(., expressions)

cat(format(ops))

And this operator pipeline is ready to use (assuming we have rqdatatable available):

if(have_rqdatatable) {
  example_data %.>%
    ops %.>%
    knitr::kable(.)
}

The expression construction can also be done inside the extend_se() operator.

ops <- local_td(example_data) %.>%
  extend_se(., vars := paste0(vars, ' - mean(', vars, ')'))

cat(format(ops))

Note: the above is only a notional example to demonstrate the operations; for supervised machine learning we would probably use base::scale(), which saves the learned centering for later re-use on new data.

The point is: we use standard R tools to build up the lists of names and operations. We are not restricted to any single argument manipulation grammar.

For example we could build all interaction terms as follows.

combos <- t(combn(vars, 2))
interactions <- 
  paste0(combos[, 1], '_', combos[, 2]) := 
  paste0(combos[, 1], ' * ', combos[, 2])

print(interactions)
ops <- local_td(example_data) %.>%
  extend_se(., interactions)

cat(format(ops))

It is a critical advantage to work with sets of variables as standard values. In this case what we can do is limited only by the power of R itself.

Note: we also supply an alias for := as %:=% for those that don't want to confuse this assignment with how the symbol is used in data.table. Take care that := has the precedence-level of an assignment and %:=% has the precedence-level of a user defined operator.

As, as always, our queries can be used on data.

if(have_rqdatatable) {
  example_data %.>%
    ops %.>%
    knitr::kable(.)
}

Or even in a database.

have_db <- requireNamespace("DBI", quietly = TRUE) && 
  requireNamespace("RSQLite", quietly = TRUE)

if(have_db) {
  raw_connection <- DBI::dbConnect(RSQLite::SQLite(), 
                                   ":memory:")
  RSQLite::initExtension(raw_connection)
  db <- rquery_db_info(
    connection = raw_connection,
    is_dbi = TRUE,
    connection_options = rq_connection_tests(raw_connection))

  rq_copy_to(db, 'example_data',
             example_data,
             temporary = TRUE, 
             overwrite = TRUE)

  sql <- to_sql(ops, db)

  cat(format(sql))
}
if(have_db) {
  res_table <- materialize(db, ops)
  DBI::dbReadTable(raw_connection, res_table$table_name) %.>%
    knitr::kable(.)
}
if(have_db) {
  DBI::dbDisconnect(raw_connection)
}


Try the rquery package in your browser

Any scripts or data that you put into this service are public.

rquery documentation built on Aug. 20, 2023, 9:06 a.m.