nonxdf: Methods for non-Xdf RevoScaleR data sources
In RevolutionAnalytics/dplyrXdf: Tools for working with Microsoft R Server Xdf files and the dplyr package

Description Details

Despite the name, dplyrXdf can work with any RevoScaleR data source, not just Xdf files. These are the verbs that can accept as inputs non-Xdf data sources.

There are a number of ways in which dplyrXdf verbs handle non-Xdf data sources:

File data sources, including delimited text (RxTextData), Avro (RxAvroData), SAS datasets (RxSasData) and SPSS datasets (RxSpssData), are generally handled inline, ie, they are read and processed much like an Xdf file would be.
ODBC data sources, including RxOdbcData, RxSqlServerData and RxTeradata usually represent tables in a SQL database. These data sources are converted into a dplyr tbl, which is then processed by dplyr (not dplyrXdf) in-database.
A Hive table (RxHiveData) in HDFS is turned into a sparklyr tbl and processed by sparklyr.
Other data sources are converted to Xdf format and then processed. The main difference between this and 1) above is that the data is written to an Xdf file first, before being transformed; this is less efficient due to the extra I/O involved.

Running a pipeline in-database requires that a suitable dplyr backend for the DBMS in question be available. There are backends for many popular commercial and open-source DBMSes, including SQL Server, PostgreSQL and Apache Hive; a Teradata backend is not yet available, but is in development at the time of writing (September 2017). For more information on how dplyr executes pipelines against database sources, see the database vignette on the Tidyverse website. Using this functionality does require you to install a few additional packages, namely odbc and dbplyr (and their dependencies).

Similarly, running a pipeline on a Hive data source with sparklyr requires that package to be installed. You must also be running on the edge node of a Spark cluster (not on a remote client, and not on a Hadoop cluster). For best results it's recommended that you use rxSparkConnect(interop="sparklyr") to set the compute context; otherwise, dplyrXdf will open a separate sparklyr connection via spark_connect(master="yarn-client"), which may or may not be appropriate for your cluster. More information about sparklyr is available on the Rstudio Sparklyr site.

While running a pipeline in-database or in-Spark can often be much more efficient than running the code locally, there are a few points to be aware of.

For in-database pipelines, each pipeline will open a separate connection to the database; this connection remains open while any tbl objects related to the pipeline still exist. This is unlikely to cause problems for interactive use, but may do so if the code is reused for batch jobs, eg as part of a predictive web service.
Which verbs are supported will vary by backend. For example, factorise and do_xdf are meant for Xdf files, and will probably fail inside a database.
The Xdf-specific arguments .outFile and .rxArgs are not available in-database or in sparklyr. In particular, this means you cannot use a transformFunc to carry out arbitrary transformations on the data.

RevolutionAnalytics/dplyrXdf documentation built on June 3, 2019, 9:08 p.m.

RevolutionAnalytics/dplyrXdf index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

RevolutionAnalytics/dplyrXdf
Tools for working with Microsoft R Server Xdf files and the dplyr package

nonxdf: Methods for non-Xdf RevoScaleR data sources
In RevolutionAnalytics/dplyrXdf: Tools for working with Microsoft R Server Xdf files and the dplyr package

Description

Details

Related to nonxdf in RevolutionAnalytics/dplyrXdf...

R Package Documentation

Browse R Packages

We want your feedback!

RevolutionAnalytics/dplyrXdf Tools for working with Microsoft R Server Xdf files and the dplyr package

nonxdf: Methods for non-Xdf RevoScaleR data sources In RevolutionAnalytics/dplyrXdf: Tools for working with Microsoft R Server Xdf files and the dplyr package

Description

Details

Related to nonxdf in RevolutionAnalytics/dplyrXdf...

R Package Documentation

Browse R Packages

We want your feedback!

RevolutionAnalytics/dplyrXdf
Tools for working with Microsoft R Server Xdf files and the dplyr package

nonxdf: Methods for non-Xdf RevoScaleR data sources
In RevolutionAnalytics/dplyrXdf: Tools for working with Microsoft R Server Xdf files and the dplyr package