As part of our consulting practice Win-Vector LLC has been helping
a few clients stand-up advanced analytics and machine learning stacks using R
and
substantial data stores (such as relational database variants such as PostgreSQL
or big data systems such as Spark
).
Often we come to a point where we or a partner realize: "the design would be a whole lot easier if we could phrase it in terms of higher order data operators."
The R
package DBI
gives us direct
access to SQL
and the package dplyr
gives us access to a transform grammar that can either be executed or translated into SQL
.
But, as we point out in the replyr
README
: moving from in-memory R
to large data systems is always a bit of a shock as you lose a lot of your higher order data
operators or transformations. Missing operators include:
I can repeat this.
If you are an R
user used to using one of
dplyr::bind_rows()
, base::split()
, tidyr::spread()
, or tidyr::gather()
:
you will find these functions do not work on remote data sources, but have
replacement implementations in the replyr
package.
For example:
library("RPostgreSQL") suppressPackageStartupMessages(library("dplyr")) isSpark <- FALSE # Can work with PostgreSQL my_db <- DBI::dbConnect(dbDriver("PostgreSQL"), host = 'localhost', port = 5432, user = 'postgres', password = 'pg') # # Can work with Sparklyr # my_db <- sparklyr::spark_connect(version='2.2.0', # master = "local") # isSpark <- TRUE d <- dplyr::copy_to(my_db, data.frame(x = c(1,5), group = c('g1', 'g2'), stringsAsFactors = FALSE), 'd') print(d) # show dplyr::bind_rows() fails. dplyr::bind_rows(list(d, d))
The replyr
package supplies R
accessible
implementations of these missing operators
for large data systems such as PostgreSQL
and Spark
.
For example:
# using the development version of replyr https://github.com/WinVector/replyr library("replyr") packageVersion("replyr") # binding rows dB <- replyr_bind_rows(list(d, d)) print(dB) # splitting frames replyr_split(dB, 'group') # pivoting pivotControl <- buildPivotControlTableN('d', columnToTakeKeysFrom = 'group', columnToTakeValuesFrom = 'x', sep = '_', my_db = my_db) dWname <- moveValuesToColumnsN(keyColumns = NULL, controlTable = pivotControl, tallTable = 'd', my_db = my_db, strict = FALSE) dW <- dplyr::tbl(my_db, dWname) print(dW) # un-pivoting unpivotControl <- buildUnPivotControlTable(nameForNewKeyColumn = 'group', nameForNewValueColumn = 'x', columnsToTakeFrom = colnames(dW)) dXname <- moveValuesToRowsN(controlTable = unpivotControl, wideTable = dWname, my_db = my_db) dX <- dplyr::tbl(my_db, dXname) print(dX)
The point is: using the replyr
package you can design in terms of higher-order data transforms, even
when working with big data in R
. Designs in terms of these operators tend to
be succinct, powerful, performant, and maintainable.
To master the terms moveValuesToRows
and moveValuesToColumns
I suggest trying the following
two articles:
if(isSpark) { status <- sparklyr::spark_disconnect(my_db) } else { status <- DBI::dbDisconnect(my_db) } my_db <- NULL
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.