First method would be to send SQL query directly to the database. The second method would to use dplyr's functions to create the SQL query. Both methods allow R to manipulate data on the database with out actually pulling the data into R and filling up R's memory.
Firstly lets make a connection to the database.
library(tidyverse) library(connectR) src_connectR("PostgreSQL35W")->post
db_send_query(post,"create table kaggle_test as (Select a.name, a.style, a.abv, b.brewery_name, b.city from kaggle_beers a inner join kaggle_breweries b on a.brewery_id = b.brewery_id) with no data;") db_send_query(post,"insert into kaggle_test Select a.name, a.style, a.abv, b.brewery_name, b.city from kaggle_beers a inner join kaggle_breweries b on a.brewery_id = b.brewery_id where b.city = 'San Francisco';")
Let's view the data.
tbl(post,"kaggle_test")
db_send_query(post,"drop table kaggle_test")
tbl(post,"kaggle_beers")->beers tbl(post,"kaggle_breweries")->breweries
inner_join(beers %>% select(name,style,abv,brewery_id), breweries %>% select(brewery_id,brewery_name,city) %>% filter(city=="San Francisco"), by=c("brewery_id","brewery_id")) %>% compute(name = "kaggle_test")
Let's view the data.
tbl(post,"kaggle_test")
db_drop_table(post,"kaggle_test")
We know the SQL query that was sent to the database in method 1, but what was the SQL query generated by dplyr? To answer this we can use the show_query
function.
inner_join(beers %>% select(name,style,abv,brewery_id), breweries %>% select(brewery_id,brewery_name,city) %>% filter(city=="San Francisco"), by=c("brewery_id","brewery_id")) %>% show_query
Both methods send SQL queries to the database. As suspected the SQL queries are very similar, dplyr subsets each table before doing the join. However this will be down to how the R dplyr code has been constructed. We can alert the code so that the select, and filter functions are out side the inner_join then dplyr will generate different SQL.
inner_join(beers,breweries, by=c("brewery_id","brewery_id")) %>% select(name,style,abv,brewery_name,city) %>% filter(city=="San Francisco")%>% show_query
As long as you write working R dplyr code, dplyr is rebust enough to generate the SQL variant. This makes R dplyr a really useful tool kit within the R family.
We have used alot of features within the connectR package, however connectR has some hidden features within dplyr functions. For example connectR offers a feature of creating primary indexes / keys for new tables in databases. To do this connectR utilises a function called db_create_primary
. This function just creates a table with no database on a data frame within R. Lets see how this works within the dplyr function copy_to
function.
Let's start with a complicated dataset.
read_csv("D:/Program_Files/R_Code/Projects/Kaggle/1. Data/Pokemon.csv")->Pokemon names(Pokemon)
This data.frame contains " " and "." within the variable titles. This can causes issues when sending this data frame to a database. However connectR has some built in feature to handle this. let's see how connectR can handle this situation.
# split the data.frame into test and predict. test<-sample_frac(Pokemon,0.8) predict<-setdiff(Pokemon,test) copy_to(post,test,"testtable",indexes=list(c("type_1","type_2")),primary=c("x","name"))
It is as easy as that, connectR will do the hard work of convert R data types into the relavent databases. connectR has created the primary keys "x" and "name" alsoand index "type_1" and "type_2". It also has converted "." and " " so that the user doesn't have to worry. What about inserting new data into a data table? connectR will also take care of the inserting into to a database even when the value names haven't been converted to match the database version, for example:
names(predict) tbl(post,"testtable") %>% collect %>% names
copy_to(post,predict,"testtable",append=T)
Finally how do we know if connectR has created a primary key/ index and index on the table testtable
?
tbl(post,sql("select t.relname as table_name, i.relname as index_name, a.attname as column_name, a.attnum, ix.indkey, t.relkind, case when substr(i.relname,strpos(i.relname,'_')+1,4) = 'pkey' then 'pkey' else 'index' end as type from pg_class t inner join pg_index ix on t.oid = ix.indrelid inner join pg_attribute a on a.attrelid = t.oid inner join pg_class i on ix.indexrelid = i.oid")) %>% filter(relkind == 'r', table_name == 'testtable') %>% select (1,3,type)
Here we can use a combination of SQL and R code to query the database. From here we can see that primary keys "x" and "name" have been create, also "type_1", "type_2" has been created as indexes. From this we can see connectR offers alot of functionality for R to connect databases.
Lets clear up the database.
db_drop_table(post,"testtable")
copy_loop_to
Finally connectR allows tables to be transferred from the initial database to another. This is helpful when 2 databases do not communicate with one another. copy_loop_to
transfers tables in batch so that it doesn't overload R's memory. copy_loop_to
will only send data to a existing table on the target database. To do this we can use the db_create_primary
function to create an empty table with primary keys or indexes. We can the use copy_loop_to
function for the transfer.
tbl(post,sql("Select top 1 Account_Year_Week, Season, Max(Cal_Date) as Week_Ending from cis.CMI_Calendar group by 1,2")) %>% collect->test
SQL<-"Select Account_Year_Week, Season, Max(Cal_Date) as Week_Ending from cis.CMI_Calendar group by 1,2"
db_create_primary(post, "testtable", test, c("Account_Year_Week","Week_Ending"))
copy_loop_to(conn1=post, conn2=post, name="testtable", n=100, statement=SQL)
NOTE: If conn2
is not provided then copy_loop_to
will send the table's data to conn1
connection.
Clearing down database.
db_drop_table(post,"testtable")
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.