extras/AdHocQueries.md

Ad Hoc Queries

John Mount, Win-Vector LLC 2018-07-06

Database Operator Pipelines

rquery's primary workflow is building re-usable database operator pipelines.

Let's try an example. First let's set up our example database and data.

library("rquery")
db = DBI::dbConnect(RSQLite::SQLite(), 
                    ":memory:")
RSQLite::initExtension(db)

DBI::dbWriteTable(db,
                  'd',
                  data.frame(AUC = 0.6, 
                             R2 = c(0.1, 0.2), 
                             D = NA, z = 2),
                  overwrite = TRUE,
                  temporary = TRUE)
d <- db_td(db, 'd')
print(d)
## [1] "table('`d`'; AUC, R2, D, z)"
DBI::dbGetQuery(db, to_sql(d, db))
##   AUC  R2  D z
## 1 0.6 0.1 NA 2
## 2 0.6 0.2 NA 2

Now we can define a query over this table.

q <- d %.>%
  select_rows_nse(., R2 > 0.14) %.>%
  extend_nse(., c = sqrt(R2)) %.>%
  select_columns(., c("AUC", "R2", "c"))

The idea is:

We can print the query/operator pipeline:

cat(format(q))
table('`d`'; 
  AUC,
  R2,
  D,
  z) %.>%
 select_rows(.,
   R2 > 0.14) %.>%
 extend(.,
  c := sqrt(R2)) %.>%
 select_columns(.,
   AUC, R2, c)

And we can ask questions of it:

column_names(q)
## [1] "AUC" "R2"  "c"
tables_used(q)
## [1] "d"
columns_used(q)
## $d
## [1] "AUC" "R2"

And we can convert the operator pipeline to SQL which can then be applied to an actual database table.

sql <- to_sql(q, db)
cat(sql)
SELECT
 `AUC`,
 `R2`,
 `c`
FROM (
 SELECT
  `AUC`,
  `R2`,
  sqrt ( `R2` )  AS `c`
 FROM (
  SELECT * FROM (
   SELECT
    `AUC`,
    `R2`
   FROM
    `d`
  ) tsql_15389734199373316826_0000000000
  WHERE `R2` > 0.14
  ) tsql_15389734199373316826_0000000001
) tsql_15389734199373316826_0000000002
DBI::dbGetQuery(db, sql) %.>%
  knitr::kable(.)

| AUC| R2| c| |----:|----:|----------:| | 0.6| 0.2| 0.4472136|

Ad Hoc mode

rquery also has an "Ad Hoc" mode for interactive analysis. In this mode things are sped up in that the use can work with in-memory tables and also skip the table modeling step.

Let's first set the global option rquery.rquery_db_executor to our database handle so the ad hoc mode knows which database to use to implement the analyses.

old_o <- options(list("rquery.rquery_db_executor" = list(db = db)))

We can now run operators directly on in-memory data.frames.

dL <- data.frame(AUC = 0.6, 
                 R2 = c(0.1, 0.2), 
                 D = NA, z = 2)

# use data frame to define the pipeline, captures only column structure
ops <- dL %.>%
  select_rows_nse(., R2 > 0.14)

# apply pipeline to any data frame with similar column structure
dL %.>% 
  ops %.>% 
  knitr::kable(.)

| AUC| R2| D| z| |----:|----:|----:|----:| | 0.6| 0.2| NA| 2|

ops <- dL %.>%
  select_rows_nse(., R2 > 0.14) %.>%
  extend_nse(., c = sqrt(R2))  %.>%
  select_columns(., c("AUC", "R2", "c")) 

dL %.>% 
  ops %.>% 
  knitr::kable(.)

| AUC| R2| c| |----:|----:|----------:| | 0.6| 0.2| 0.4472136|

dL %.>% 
  select_rows_nse(., R2 > 0.14) %.>%
  extend_nse(., c = sqrt(R2))  %.>%
  select_columns(., c("AUC", "R2", "c")) %.>% 
  knitr::kable(.)

| AUC| R2| c| |----:|----:|----------:| | 0.6| 0.2| 0.4472136|

Cleanup

options(old_o)
DBI::dbDisconnect(db)


WinVector/rquery documentation built on Aug. 24, 2023, 11:12 a.m.