Despite the name, dplyrXdf can work with any RevoScaleR data source, not just Xdf files. These are the verbs that can accept as inputs non-Xdf data sources.
There are a number of ways in which dplyrXdf verbs handle non-Xdf data sources:
File data sources, including delimited text (RxTextData), Avro (RxAvroData), SAS datasets (RxSasData) and SPSS datasets (RxSpssData), are generally handled inline, ie, they are read and processed much like an Xdf file would be.
ODBC data sources, including RxOdbcData, RxSqlServerData and RxTeradata usually represent tables in a SQL database. These data sources are converted into a dplyr tbl, which is then processed by dplyr (not dplyrXdf) in-database.
A Hive table (RxHiveData) in HDFS is turned into a sparklyr tbl and processed by sparklyr.
Other data sources are converted to Xdf format and then processed. The main difference between this and 1) above is that the data is written to an Xdf file first, before being transformed; this is less efficient due to the extra I/O involved.
Running a pipeline in-database requires that a suitable dplyr backend for the DBMS in question be available. There are backends for many popular commercial and open-source DBMSes, including SQL Server, PostgreSQL and Apache Hive; a Teradata backend is not yet available, but is in development at the time of writing (September 2017). For more information on how dplyr executes pipelines against database sources, see the database vignette on the Tidyverse website. Using this functionality does require you to install a few additional packages, namely odbc and dbplyr (and their dependencies).
Similarly, running a pipeline on a Hive data source with sparklyr requires that package to be installed. You must also be running on the edge node of a Spark cluster (not on a remote client, and not on a Hadoop cluster). For best results it's recommended that you use rxSparkConnect(interop="sparklyr") to set the compute context; otherwise, dplyrXdf will open a separate sparklyr connection via spark_connect(master="yarn-client"), which may or may not be appropriate for your cluster. More information about sparklyr is available on the Rstudio Sparklyr site.
While running a pipeline in-database or in-Spark can often be much more efficient than running the code locally, there are a few points to be aware of.
For in-database pipelines, each pipeline will open a separate connection to the database; this connection remains open while any tbl objects related to the pipeline still exist. This is unlikely to cause problems for interactive use, but may do so if the code is reused for batch jobs, eg as part of a predictive web service.
Which verbs are supported will vary by backend. For example, factorise and do_xdf are meant for Xdf files, and will probably fail inside a database.
The Xdf-specific arguments .outFile and .rxArgs are not available in-database or in sparklyr. In particular, this means you cannot use a transformFunc to carry out arbitrary transformations on the data.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.