hdfs: Utilities for HDFS

Description Usage Arguments Details Value See Also Examples

Description

Functions for working with files in HDFS: directory listing; file copy, move and delete; directory create and delete; test for file/directory existence; check if in HDFS; expunge Trash.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
hdfs_dir(path = ".", ..., full_path = FALSE, include_dirs = FALSE,
  recursive = FALSE, dirs_only = FALSE, pattern = NULL,
  host = hdfs_host())

## S3 method for class 'dplyrXdf_hdfs_dir'
print(x, ...)

hdfs_host(object = NULL)

hdfs_dir_exists(path, host = hdfs_host())

hdfs_file_exists(path, host = hdfs_host())

hdfs_dir_create(path, ..., host = hdfs_host())

hdfs_dir_remove(path, ..., host = hdfs_host())

hdfs_file_copy(src, dest, ..., host = hdfs_host())

hdfs_file_move(src, dest, ..., host = hdfs_host())

hdfs_file_remove(path, ..., host = hdfs_host())

hdfs_expunge()

in_hdfs(object)

Arguments

path

A HDFS pathname.

...

For hdfs_dir, further switches, prefixed by "-", to pass to the Hadoop fs -ls command. For other functions, further arguments to pass to rxHadoopCommand.

full_path

For hdfs_dir, whether to prepend the directory path to filenames to give a full path. If FALSE, only file names are returned.

include_dirs

For hdfs_dir, if subdirectory names should be included. Always TRUE for non-recursive listings.

recursive

For hdfs_dir, if the listing should recurse into subdirectories.

dirs_only

For hdfs_dir, if only subdirectory names should be included.

pattern

For hdfs_dir, an optional regular expression. Only file names that match will be returned.

host

The HDFS hostname as a string, in the form adl://host.name. You should need to set this only if you have an attached Azure Data Lake Store that you are accessing via HDFS. Can also be an RxHdfsFileSystem object, in which case the hostname will be taken from the object.

object

For in_hdfs and hdfs_host, An R object, typically a RevoScaleR data source object.

src, dest

For hdfs_file_copy and hdfs_file_move, the source and destination paths.

Details

These are utility functions to simplify working with files and directories in HDFS. For the most part, they wrap lower-level functions provided by RevoScaleR, which in turn wrap various Hadoop file system commands. They work with any file that is stored in HDFS, not just Xdf files.

The hdfs_dir function is analogous to dir for the native filesystem. Like that function, and unlike rxHadoopListFiles, it returns a vector of filenames (rxHadoopListFiles returns a vector of printed output from the hadoop fs -ls command, which is not quite the same thing). Again unlike rxHadoopListFiles, it does not print anything by default (the print method takes care of that).

hdfs_dir_exists and hdfs_file_exists test for the existence of a given directory and file, respectively. They are analogous to dir.exists and file.exists for the native filesystem.

hdfs_dir_create and hdfs_dir_remove create and remove directories. They are analogous to dir.create and unlink(recursive=TRUE) for the native filesystem.

hdfs_file_copy and hdfs_file_move copy and move files. They are analogous to file.copy and file.rename for the native filesystem. Unlike rxHadoopCopy and rxHadoopMove, they are vectorised in both src and dest.

Currently, RevoScaleR has only limited support for accessing multiple HDFS filesystems simultaneously. In particular, src and dest should both be on the same HDFS filesystem, whether host or ADLS.

hdfs_file_remove deletes files. It is analogous to file.remove and unlink for the native filesystem.

hdfs_expunge empties the HDFS trash.

Value

hdfs_dir returns a vector of filenames, optionally with the full path attached.

hdfs_host returns the hostname of the HDFS filesystem for the given object. If no object is specified, or if the object is not in HDFS, it returns the hostname of the currently active HDFS filesystem. This is generally "default" unless you are in the RxHadoopMR or RxSpark compute context and using an Azure Data Lake Store, in which case it returns the ADLS name node.

hdfs_dir_exists and hdfs_file_exists return TRUE or FALSE depending on whether the directory or file exists.

The other hdfs_* functions return TRUE or FALSE depending on whether the operation succeeded.

in_hdfs returns whether the given object is stored in HDFS. This will be TRUE for an Xdf data source or file data source in HDFS, or a Spark data source. Classes for the latter include RxHiveData, RxParquetData and RxOrcData.

See Also

dir, dir.exists, file.exists, dir.create, file.copy, file.rename, file.remove, unlink, rxHadoopListFiles, rxHadoopFileExists, rxHadoopMakeDir, rxHadoopRemoveDir, rxHadoopCopy, rxHadoopMove, rxHadoopRemove

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
## Not run: 
hdfs_host()

mtx <- as_xdf(mtcars, overwrite=TRUE)
mth <- copy_to_hdfs(mtx)
in_hdfs(mtx)
in_hdfs(mth)
hdfs_host(mth)

# always TRUE
hdfs_dir_exists("/")
# should always be TRUE if Microsoft R is installed on the cluster
hdfs_dir_exists("/user/RevoShare")

# listing of home directory: /user/<username>
hdfs_dir()

# upload an arbitrary file
desc <- system.file("DESCRIPTION", package="dplyrXdf")
hdfs_upload(desc, "dplyrXdf_description")
hdfs_file_exists("dplyrXdf_description")

# creates /user/<username>/foo
hdfs_dir_create("foo")
hdfs_file_copy("dplyrXdf_description", "foo")
hdfs_file_exists("foo/dplyrXdf_description")

hdfs_file_remove("dplyrXdf_description")
hdfs_dir_remove("foo")

## End(Not run)

RevolutionAnalytics/dplyrXdf documentation built on June 3, 2019, 9:08 p.m.