db: Simple database interface

Description Usage Arguments Details Value Multiple tables/connections Limitations Auto table name DB interface dictionary Note See Also Examples

View source: R/db.R

Description

Common db interface for DBI, RJDBC, RODBC and other custom defined off-memory storage.

Usage

1
2
3
4
5
6
7
db(x, ..., key, .db.preprocess = getOption("dwtools.db.preprocess"),
  .db.postprocess = getOption("dwtools.db.postprocess"),
  .db.conns = getOption("dwtools.db.conns"),
  .db.dict = getOption("dwtools.db.dict"),
  .db.batch.action = getOption("dwtools.db.batch.action"),
  timing = getOption("dwtools.timing"),
  verbose = getOption("dwtools.verbose"))

Arguments

x

data.table (to save in db) or character of table names or character of sql commands.

key

character or integer, character vector to be used to set key or integer columns position to setkey.

.db.preprocess

logical.

.db.postprocess

logical.

.db.conns

list of connections uniquely named.

.db.dict

data.table db interface dictionary.

.db.batch.action

character action name, use only when no recycling required, no action detection required, no timing required.

timing

logical measure timing, make timings for each query in case of vectorized input, read timing.

verbose

integer, if greater than 0 then print debugging messages.

...

if x is data.table then ... expects character table names and character connection names else ... expects only character connection names.

Details

Function is designed to be slim and chainable in data.table `[` operator. It accept vectorized input for all combinations of character arguments, see Multiple tables/connections section.

Value

In case of write / read / get the data.table object (possibly with some extra attributes). In case of send action the send query results.

Multiple tables/connections

Table names, sql commands, connection names can be character vectors. It allows processing into multiple connections and tables at once. The list of results will be returned, except the write action where single data.table will be always returned (for chaining). It will be named by the connection names, so if the connecion name was recycled (e.g. db(c("my_tab1","my_tab2"))) then there will be duplicated names in the resulted list.

Limitations

Table names must not contains spaces (which are accepted by some db vendors).
SQL send statements should contains spaces, e.g. sqlite .tables command needs to be written as db("SELECT * FROM sqlite_master WHERE type='table'").
Below are the per driver name limitations:

Auto table name

If writing to db and table name is missing or NULL then the auto.table.name will be used, auto generated tablename can be catched for later use by (attr(DT,"tablename",TRUE), read note section.

DB interface dictionary

If you read/write to non-default schema you should use "my_schema1.my_tab1" table names, it will be translated to expected format for target db, e.g. for postgres: c("my_schema1","my_tabl1").
SQL statements are of course not unified but most of the syntax is already common across different db.
There are preprocessing and postprocessing functions available per defined db driver. Those functions can be used for seemless integration in case if write/read to db lose classes of the data. This gives R ability to act as data hub and gain value as ETL tool.
You can add new db interfaces by extending db_dict. Pull Requests are welcome.

Note

In case of get and send actions any semicolon ; sign as the last char will be removed from query. When write is used it will also set tablename attribute by reference to a data.table which was written to database, because it is done by reference it will alter input data.table also and overwrite any previous tablename attribute, you can always use newDT <- db(copy(DT),NULL) to keep both.

See Also

dbCopy, timing

Examples

  1
  2
  3
  4
  5
  6
  7
  8
  9
 10
 11
 12
 13
 14
 15
 16
 17
 18
 19
 20
 21
 22
 23
 24
 25
 26
 27
 28
 29
 30
 31
 32
 33
 34
 35
 36
 37
 38
 39
 40
 41
 42
 43
 44
 45
 46
 47
 48
 49
 50
 51
 52
 53
 54
 55
 56
 57
 58
 59
 60
 61
 62
 63
 64
 65
 66
 67
 68
 69
 70
 71
 72
 73
 74
 75
 76
 77
 78
 79
 80
 81
 82
 83
 84
 85
 86
 87
 88
 89
 90
 91
 92
 93
 94
 95
 96
 97
 98
 99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
suppressPackageStartupMessages(library(dwtools))

# Setup db connections --------------------------------------------------------------------

##### define your connections
# csv and SQLite works out of the box without configuration outside of R.
# examples are on three instances of sqlite and a csv.

library(RSQLite) # install.packages("RSQLite")
sqlite1 = list(drvName="SQLite",dbname="sqlite1.db")
sqlite1$conn = dbConnect(SQLite(), dbname=sqlite1$dbname)
sqlite2 = list(drvName="SQLite",dbname="sqlite2.db")
sqlite2$conn = dbConnect(SQLite(), dbname=sqlite2$dbname)
sqlite3 = list(drvName="SQLite",dbname="sqlite3.db")
sqlite3$conn = dbConnect(SQLite(), dbname=sqlite3$dbname)
csv1 = list(drvName = "csv")

# configure connections
options("dwtools.db.conns"=list(sqlite1=sqlite1,sqlite2=sqlite2,sqlite3=sqlite3,csv1=csv1))

## external dependencies required
# library(RPostgreSQL) # install.packages("RPostgreSQL")
# psql1 <- list(drvName="PostgreSQL", host="localhost", port="5432", dbname="dwtools", user="dwtools")
# psql1$conn <- dbConnect(PostgreSQL(), host=psql1$host, port=psql1$port, dbname=psql1$dbname, user=psql1$user, password="dwtools_pass")
# library(RMySQL) # install.packages("RMySQL")
# mysql1 = list(drvName="MySQL", host="localhost", port="3306", dbname="dwtools", user="dwtools")
# mysql1$conn <-dbConnect(MySQL(), host=mysql1$host, port=mysql1$port, dbname=mysql1$dbname, user=mysql1$user, password="dwtools_pass")
# library(RODBC) # install.packages("RODBC")
# odbc1 <- list(drvName="ODBC", user="dwtools", dbname="dwtools", dsn="mydsn")
# odbc1$conn <- odbcConnect(dsn=odbc1$dsn, uid=odbc1$user, pwd="dwtools_pass")

# Basic usage --------------------------------------------------------------------

(DT = dw.populate(1e5,scenario="fact")) # fact table

### write, aka INSERT + CREATE TABLE

db(DT,"my_tab1") # write to db, using default db connection (first in list)
db(DT,"my_tab2","sqlite2") # WRITE to my_tab_alt to sqlite2 connection
db(DT,"my_tab1","csv1") # WRITE to my_tab1.csv
r1 = db(DT) # write to auto named table in default db connection (first in list)
attr(r1,'tablename',TRUE) # auto generated table name # ?auto.table.name
r2 = db(DT,NULL,"sqlite2") # the same above but another connection, override r1 attribute! read ?db note
attr(r2,'tablename',TRUE)
l = db(DT,c("my_tab11","my_tab22"),c("sqlite1","sqlite2")) # save into different connections and different tables
attr(l,'tablename',TRUE)

### read, aka: SELECT * FROM 

db("my_tab1")
db("my_tab2","sqlite2")
db("my_tab1","csv1") # READ from my_tab1.csv
r1 = db("my_tab1","sqlite1",key=c("prod_code","cust_code","geog_code","time_code")) # set key on result, useful on chaining, see 'Chaining data.table' examples below
key(r1)
db(DT, "my_tab2") # CREATE TABLE just for below line example
l = db("my_tab2", c("sqlite1","sqlite2")) # read my_tab2 table from two connections, return list
str(l)
l = db(c("my_tab11","my_tab22"), c("sqlite1","sqlite2")) # read my_tab1 and my_tab2 table from two connections, return list
str(l)

### get, aka: SELECT ... FROM ... JOIN ...

db("SELECT * FROM my_tab1")
r = db("SELECT * FROM my_tab2","sqlite2",key=c("prod_code","cust_code","geog_code","time_code"))
key(r)
l = db(c("SELECT * FROM my_tab1","SELECT * FROM my_tab2"),c("sqlite1","sqlite2"))
str(l)

### send, aka: UPDATE, INDEX, DROP, etc.

db(c("CREATE INDEX idx_my_tab1a ON my_tab1 (prod_code, geog_code)","CREATE INDEX idx_my_tab1b ON my_tab1 (cust_code, time_code)")) # create two indices
db(c("DROP INDEX idx_my_tab1a","DROP INDEX idx_my_tab1b")) # drop two indices
db("DROP TABLE my_tab2") # drop the table which we created in above example #CREATE TABLE
db(c("DROP TABLE my_tab1","DROP TABLE my_tab2"),c("sqlite1","sqlite2")) # multiple statements into multiple connections

# Advanced usage ------------------------------------------------------

options("dwtools.verbose"=1L)
db.conns.names = c("sqlite1","sqlite2","sqlite3")

### easy sql scripting: DROP ALL TABLES IN ALL DBs
(DT = dw.populate(1e5,scenario="fact")) # fact table

# populate 2 tables in sqlite3 while chaining: db(DT,NULL,"sqlite3"), auto table names
DT[,db(.SD,NULL,c("sqlite3","sqlite3"))]

# populate 2 tables in each connection, then 1 table in each connection, 9 tables created
DT[,db(.SD,NULL,rep(db.conns.names,2))][,db(.SD,NULL,db.conns.names)]

# query all tables on all connections
(tbls = db("SELECT name FROM sqlite_master WHERE type='table'",db.conns.names))

# drop all tables on all connections
ll = lapply(1:length(tbls), function(i, tbls){
  if(nrow(tbls[[i]]) > 0) data.table(conn_name = names(tbls[i]), tbls[[i]])
  else data.table(conn_name = character(), tbls[[i]])
}, tbls)
r = rbindlist(ll)[,list(sql=paste0("DROP TABLE ",name), conn_name=conn_name, name=name) # build statement
                  ][,list(conn_name=conn_name, name=name, res=db(sql,conn_name)) # exec DROP TABLE ...
                    ]
# verify tables dropped
db("SELECT name FROM sqlite_master WHERE type='table'",db.conns.names)

### Chaining data.table: DT[...][...]

# populate star schema to db
X = dw.populate(1e5,scenario="star") # list of 5 tables, 1 fact table and 4 dimensions
db(X$TIME,"time") # save time to db
db(X$GEOGRAPHY,"geography") # save geography to db
db(X$SALES,"sales") # save sales FACT to db

# data.table join in R directly on external SQL database
db("geography",key="geog_code")[db("sales",key="geog_code")] # geography[sales]

options("dwtools.timing"=TRUE) # turn on db auditing
## Chaining including multiple read and multiple write directly on SQL database
# 0. predefine aggregate function for later use
# 1. query sales fact table from db
# 2. aggregate to 2 dimensions
# 3. save current state of data to db
# 4. query geography dimension table from db
# 5. sales left join geography dimension
# 6. aggregate to higher geography entity
# 7. save current state of data to db
# 8. query time dimension table from db
# 8. sales left join time dimension
# 9. aggregate to higher time entity
# 10. save current state of data to db
jj_aggr = quote(list(amount=sum(amount), value=sum(value)))
r <- db("sales",key="geog_code" # read fact table from db
        )[,eval(jj_aggr),keyby=c("geog_code","time_code") # aggr by geog_code and time_code
          ][,db(.SD) # write to db, auto.table.name
            ][,db("geography",key="geog_code" # read lookup geography dim from db
                  )[.SD # left join geography
                    ][,eval(jj_aggr), keyby=c("time_code","geog_region_name")] # aggr
              ][,db(.SD) # write to db, auto.table.name
                ][,db("time",key="time_code" # read lookup time dim from db
                      )[.SD # left join time
                        ][, eval(jj_aggr), keyby=c("geog_region_name","time_month_code","time_month_name")] # aggr
                  ][,db(.SD) # write to db, auto.table.name
                    ]
db("SELECT name FROM sqlite_master WHERE type='table'")
get.timing()

## Interesting to consider is
# how much effort would such 'query' requires if developing it in (leading commercial) ETL tools?
# can the classic ETL tools even compete with data.table transformation performance, and DBI loading/writing performance?
# free 'express' edition of ETL tools do have a processing row limit so cannot be well benchmarked.

### Copy tables

# dbCopy multiple tables from source to target # ?dbCopy
dbCopy(
  c("sales","geography","time"),"sqlite1", # source
  c("sales","geography","time"),"sqlite2"  # target
)
(tbls = db("SELECT name FROM sqlite_master WHERE type='table'","sqlite2")) # sqlite2 check
get.timing()
options("dwtools.timing"=FALSE)
purge.timing()

# Disconnecting and cleaning workspace ------------------------------------------------------

db.conns.names = c("sqlite1","sqlite2","sqlite3")
sapply(getOption("dwtools.db.conns")[names(getOption("dwtools.db.conns")) %in% db.conns.names],
       function(x) dbDisconnect(x[["conn"]])) # close SQLite connections
sapply(getOption("dwtools.db.conns")[names(getOption("dwtools.db.conns")) %in% db.conns.names],
       function(x) file.remove(x[["dbname"]])) # remove SQLite db files
options("dwtools.db.conns"=NULL) # reset dwtools.db.conns option
sapply(paste(c("my_tab1"),"csv",sep="."), function(x) file.remove(x)) # remove csv tables

jangorecki/dwtools documentation built on May 18, 2019, 12:24 p.m.