Lazy methods to send queries to Database

First method would be to send SQL query directly to the database. The second method would to use dplyr's functions to create the SQL query. Both methods allow R to manipulate data on the database with out actually pulling the data into R and filling up R's memory.

Firstly lets make a connection to the database.

library(tidyverse)
library(connectR)

src_connectR("PostgreSQL35W")->post
db_send_query(post,"create table kaggle_test as
                        (Select
                        a.name,
                        a.style,
                        a.abv,
                        b.brewery_name,
                        b.city
                        from kaggle_beers a
                        inner join kaggle_breweries b
                        on a.brewery_id = b.brewery_id) with no data;")

db_send_query(post,"insert into kaggle_test
                        Select
                        a.name,
                        a.style,
                        a.abv,
                        b.brewery_name,
                        b.city
                        from kaggle_beers a
                        inner join kaggle_breweries b
                        on a.brewery_id = b.brewery_id
                        where b.city = 'San Francisco';")

Let's view the data.

tbl(post,"kaggle_test") 
db_send_query(post,"drop table kaggle_test")
tbl(post,"kaggle_beers")->beers
tbl(post,"kaggle_breweries")->breweries
inner_join(beers %>% 
             select(name,style,abv,brewery_id),
           breweries %>% select(brewery_id,brewery_name,city) %>%
             filter(city=="San Francisco"),
           by=c("brewery_id","brewery_id")) %>% 
  compute(name = "kaggle_test")

Let's view the data.

tbl(post,"kaggle_test") 
db_drop_table(post,"kaggle_test")

Deeper look into dplyr

We know the SQL query that was sent to the database in method 1, but what was the SQL query generated by dplyr? To answer this we can use the show_query function.

inner_join(beers %>% 
             select(name,style,abv,brewery_id),
           breweries %>% select(brewery_id,brewery_name,city) %>%
             filter(city=="San Francisco"),
           by=c("brewery_id","brewery_id")) %>% show_query

Both methods send SQL queries to the database. As suspected the SQL queries are very similar, dplyr subsets each table before doing the join. However this will be down to how the R dplyr code has been constructed. We can alert the code so that the select, and filter functions are out side the inner_join then dplyr will generate different SQL.

inner_join(beers,breweries,
           by=c("brewery_id","brewery_id")) %>%
  select(name,style,abv,brewery_name,city) %>% 
  filter(city=="San Francisco")%>% show_query

As long as you write working R dplyr code, dplyr is rebust enough to generate the SQL variant. This makes R dplyr a really useful tool kit within the R family.

How to create a table with primary indexes/ keys: 'copy_to'

We have used alot of features within the connectR package, however connectR has some hidden features within dplyr functions. For example connectR offers a feature of creating primary indexes / keys for new tables in databases. To do this connectR utilises a function called db_create_primary. This function just creates a table with no database on a data frame within R. Lets see how this works within the dplyr function copy_to function.

Let's start with a complicated dataset.

read_csv("D:/Program_Files/R_Code/Projects/Kaggle/1. Data/Pokemon.csv")->Pokemon
names(Pokemon)

This data.frame contains " " and "." within the variable titles. This can causes issues when sending this data frame to a database. However connectR has some built in feature to handle this. let's see how connectR can handle this situation.

# split the data.frame into test and predict.
test<-sample_frac(Pokemon,0.8)
predict<-setdiff(Pokemon,test)

copy_to(post,test,"testtable",indexes=list(c("type_1","type_2")),primary=c("x","name"))

It is as easy as that, connectR will do the hard work of convert R data types into the relavent databases. connectR has created the primary keys "x" and "name" alsoand index "type_1" and "type_2". It also has converted "." and " " so that the user doesn't have to worry. What about inserting new data into a data table? connectR will also take care of the inserting into to a database even when the value names haven't been converted to match the database version, for example:

names(predict)
tbl(post,"testtable") %>% collect %>% names
copy_to(post,predict,"testtable",append=T)

Finally how do we know if connectR has created a primary key/ index and index on the table testtable?

tbl(post,sql("select 
                t.relname as table_name,
                i.relname as index_name,
                a.attname as column_name,
                a.attnum,
                ix.indkey,
                t.relkind,
                case when substr(i.relname,strpos(i.relname,'_')+1,4) = 'pkey' then 'pkey' else 'index' end as type
              from 
                pg_class t
              inner join 
                pg_index ix
                on t.oid = ix.indrelid
              inner join
                pg_attribute a
                on a.attrelid = t.oid
              inner join 
                pg_class i
                on ix.indexrelid = i.oid")) %>% 
  filter(relkind == 'r',
         table_name  == 'testtable') %>% 
  select (1,3,type)

Here we can use a combination of SQL and R code to query the database. From here we can see that primary keys "x" and "name" have been create, also "type_1", "type_2" has been created as indexes. From this we can see connectR offers alot of functionality for R to connect databases.

Lets clear up the database.

db_drop_table(post,"testtable")

How to transfer tables from intial database to another: copy_loop_to

Finally connectR allows tables to be transferred from the initial database to another. This is helpful when 2 databases do not communicate with one another. copy_loop_to transfers tables in batch so that it doesn't overload R's memory. copy_loop_to will only send data to a existing table on the target database. To do this we can use the db_create_primary function to create an empty table with primary keys or indexes. We can the use copy_loop_to function for the transfer.

tbl(post,sql("Select top 1
                Account_Year_Week, 
                Season, 
                Max(Cal_Date) as Week_Ending 
              from cis.CMI_Calendar 
              group by 1,2")) %>% collect->test
SQL<-"Select    
        Account_Year_Week, 
        Season, 
        Max(Cal_Date) as Week_Ending 
      from cis.CMI_Calendar 
      group by 1,2"
db_create_primary(post, "testtable", test, c("Account_Year_Week","Week_Ending"))
copy_loop_to(conn1=post, conn2=post, name="testtable", n=100, statement=SQL)

NOTE: If conn2 is not provided then copy_loop_to will send the table's data to conn1 connection.

Clearing down database.

db_drop_table(post,"testtable")


DyfanJones/connectR documentation built on May 23, 2019, 10:32 p.m.