compute: Download a dataset to the local machine

Description Usage Arguments Details Value See Also Examples

Description

Download a dataset to the local machine

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
## S3 method for class 'RxXdfData'
collect(x, as_data_frame = TRUE, ...)

## S3 method for class 'RxXdfData'
compute(x, as_data_frame = !in_hdfs(x), ...)

## S3 method for class 'RxDataSource'
compute(x, name = NULL, ...)

## S3 method for class 'RxDataSource'
collect(x, ...)

Arguments

x

An Xdf data source object.

as_data_frame

For the RxXdfData methods: should the downloaded data be converted to a data frame, or left as an Xdf file?

...

If the output is to be a data frame, further arguments to the as.data.frame method.

name

For the RxDataSource methods: the name of the Xdf file to create. Defaults to a temporary filename in the dplyrXdf working directory.

Details

RevoScaleR does not have an exact analogue of the dplyr concept of a src, and because of this, the dplyrXdf implementations of collect and compute are somewhat different. In dplyrXdf, these functions serve two related, overlapping purposes:

The code will handle both the cases where you are logged into the edge node of a Hadoop/Spark cluster, and if you are a remote client. For the latter case, the downloading is a two-stage process: the data is first transferred from HDFS to the native filesystem of the edge node, and then downloaded from the edge node to the client.

If you want to look at the first few rows of a small Xdf file in HDFS, it may be faster to use compute) to copy the entire file to the native filesystem, and then run head, than to run head on the original. This is due to RevoScaleR overhead in Spark and Hadoop.

Value

For the RxDataSource methods, collect returns a data frame, and compute returns a tbl_xdf data source. For the RxXdfData methods, either a data frame or tbl_xdf based on the as_data_frame argument.

See Also

as_xdf, as_data_frame, copy_to, compute in package dplyr

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
mtx <- as_xdf(mtcars, overwrite=TRUE)

# all of these return a data frame (or a tbl_df) for input in the native filesystem
as.data.frame(mtx)
as_data_frame(mtx)  # returns a tbl_df
collect(mtx)
compute(mtx)

# collect and compute are meant for downloading data from remote backends
## Not run: 
# downloading from a database
connStr <- "SERVER=hostname;DATABASE=RevoTestDB;TRUSTED_CONNECTION=yes"
mtdb <- RxSqlServerData("mtcars", connectionString=connStr)
copy_to(mtdb, mtcars)
as.data.frame(mtdb)
collect(mtdb)  # returns a data frame
compute(mtdb)  # returns a tbl_xdf

# downloading from HDFS
mtc <- copy_to_hdfs(mtcars)
as.data.frame(mtc)
collect(mtc)  # returns a data frame
compute(mtc)  # returns a tbl_xdf

## End(Not run)

RevolutionAnalytics/dplyrXdf documentation built on June 3, 2019, 9:08 p.m.