knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "README-" )
This package and accompanying scripts is to test load times to MS SQL Server 17 (SS) from various methods:
RODBC
odbc
This package is really for number 3. We can do the rest from a script, although it's a useful place to keep some functions.
The code for this post runs on my Windows 10 laptop, where I have a local SQL Server 17 instance running, with a database called ml
.
We're going to push a big data frame into SQL Server using three methods. The data set is a simple randomly generated data frame.
library(dbloadss) random_data_set(n_rows = 5, n_cols = 5)
R is pretty quick at this sort of thing so we don't really need to worry about how long it takes to make a big data frame.
n_rows <- 3000000 system.time({ random_data <- random_data_set(n_rows = n_rows, n_cols = 5) })
but what we're interested in is how fast to push this to SQL Server?.
RODBC
was, for a long time, the standard way to connect to SQL Server from R. It's a great package that makes it easy to send queries, collect results, and handles type conversions pretty well. However, it is a bit slow for pushing data in. The fastest I could manage was using the sqlSave
function with the safeties off. Very interested to hear if there's a better method. For 3m rows it's a no go. So scaling back to 30k rows we get:
library(RODBC) db <- odbcDriverConnect('driver={SQL Server};server=localhost\\SQL17ML;database=ml;trusted_connection=true') n_rodbc <- 30000 odbcQuery(db, "drop table randData;") time30k <- system.time({ RODBC::sqlSave( db, dat = random_data[1:n_rodbc,], tablename = "randData", rownames = FALSE, fast = TRUE, safer = FALSE ) }) odbcClose(db) time30k
Over 2 minutes! It's been roughly linear for me so that total write time for 3m rows is a few hours.
odbc
is a relatively new package from RStudio which provides a DBI compliant ODBC interface. It is considerably faster for writes. Here we'll push the full 3m rows.
library(DBI) dbi <- dbConnect(odbc::odbc(), driver = "SQL Server", server="localhost\\SQL17ML", database = "ml") time3modbc <- system.time({ dbWriteTable(dbi, name = "randData", value = random_data, overwrite = TRUE) }) dbDisconnect(dbi) time3modbc
An alternative approach is to use the new features in SQL Server 17 (and 16) for calling out to R scripts from SQL. This is done via the sp_execute_external_script
command, which we will wrap in a stored procedure. This is what that looks like for me:
use [ml]; DROP PROC IF EXISTS generate_random_data; GO CREATE PROC generate_random_data(@nrow int) AS BEGIN EXEC sp_execute_external_script @language = N'R' , @script = N' library(dbloadss) random_data <- random_data_set(n_rows = nrow_r, n_cols = 5) ' , @output_data_1_name = N'random_data' , @params = N'@nrow_r int' , @nrow_r = @nrow WITH RESULT SETS (( "COL_1" float not null, "COL_2" float not null, "COL_3" float not null, "COL_4" float not null, "COL_5" float not null)); END; GO
We then call the stored procedure with another query (skipping out a step that clears it inbetween tests).
INSERT INTO randData EXEC [dbo].[generate_random_data] @nrow = 3000000
and this runs in 34 seconds. My best guess for the performance increase is that the data is serialised more efficiently. More to investigate.
MIT © Mango Solutions
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.