copy_to: Upload a dataset to a remote backend

Description Usage Arguments Details Value Note on composite Xdf See Also Examples

Description

Upload a dataset to a remote backend

Usage

1
2
3
4
5
6
7
## S3 method for class 'RxDataSource'
copy_to(dest, df, ...)

## S3 method for class 'RxHdfsFileSystem'
copy_to(dest, df, name = NULL, ...)

copy_to_hdfs(..., host = hdfs_host(), port = rxGetOption("hdfsPort"))

Arguments

dest

The destination source: either a RevoScaleR data source object, or a filesystem object of class RxHdfsFileSystem.

df

A dataset. For the RxDataSource method, this can be any RevoScaleR data source object, presumably of a different class to the destination. For the RxHdfsFileSystem method, this can be the filename of an Xdf file, a RevoScaleR data source, or anything that can be coerced to a data frame.

...

Further arguments to lower-level functions; see below.

name

The filename, optionally including the path, for the uploaded Xdf file. The default upload location is the user's home directory (user/<username>) in the filesystem pointed to by dest. Not used for the RxDataSource method.

host, port

HDFS hostname and port number to connect to. You should need to set these only if you have an attached Azure Data Lake Store that you are accessing via HDFS.

Details

RevoScaleR does not have an exact analogue of the dplyr concept of a src, and because of this, the dplyrXdf implementation of copy_to is somewhat different. In dplyrXdf, the function serves two related, overlapping purposes:

The copy_to_hdfs function is a simple wrapper to the HDFS upload method that avoids having to create an explicit filesystem object. Its arguments other than host and port are simply passed as-is to copy_to.RxHdfsFileSystem.

The method for uploading to HDFS can handle both the cases where you are logged into the edge node of a Hadoop/Spark cluster, and where you are a remote client. For the latter case, the uploading is a two-stage process: the data is first transferred to the native filesystem of the edge node, and then copied from the edge node into HDFS. Similarly, it can handle uploading both to the host HDFS filesystem, and to an attached Azure Data Lake Store. If dest points to an ADLS host, the file will be uploaded there. You can override this by supplying an explicit an explicit URI for the uploaded file, in the form adl://azure.host.name/path. The name for the host HDFS filesystem is adl://host/.

For the HDFS upload method, any arguments in ... are passed to hdfs_upload, and ultimately to the Hadoop fs -copytoLocal command. For the data source copy method, arguments in ... are passed to rxDataStep.

copy_to is meant for copying datasets to different backends. If you are simply copying a file to HDFS, consider using hdfs_upload; or if you are copying an Xdf file to a different location in the same filesystem, use copy_xdf or file.copy.

Value

An Xdf data source object pointing to the uploaded data.

Note on composite Xdf

There are actually two kinds of Xdf files: standard and composite. A composite Xdf file is a directory containing multiple data and metadata files, which the RevoScaleR functions treat as a single dataset. Xdf files in HDFS must be composite in order to work properly; copy_to will convert an existing Xdf file into composite, if it's not already in that format. Non-Xdf datasets (data frames and other RevoScaleR data sources, such as text files) will similarly be uploaded as composite.

See Also

rxHadoopCopyFromClient, rxHadoopCopyFromLocal, collect and compute for downloading data from HDFS, as_xdf, as_composite_xdf

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
## Not run: 
# copy a data frame to SQL Server
connStr <- "SERVER=hostname;DATABASE=RevoTestDB;TRUSTED_CONNECTION=yes"
mtdb <- RxSqlServerData("mtcars", connectionString=connString)
copy_to(mtdb, mtcars)

# copy an Xdf file to SQL Server: will overwrite any existing table with the same name
mtx <- as_xdf(mtcars, overwrite=TRUE)
copy_to(mtdb, mtx)

# copy a data frame to HDFS
hd <- RxHdfsFileSystem()
mth <- copy_to(hd, mtcars)
# assign a new filename on copy
mth2 <- copy_to(hd, mtcars, "mtcars_2")

# copy an Xdf file to HDFS
mth3 <- copy_to(hd, mtx, "mtcars_3")

# same as copy_to(hd, ...)
delete_xdf(mth)
copy_to_hdfs(mtcars)

# copying to attached ADLS storage
copy_to_hdfs(mtcars, host="adl://adls.host.name")

## End(Not run)

RevolutionAnalytics/dplyrXdf documentation built on June 3, 2019, 9:08 p.m.