DFS: Hadoop Distributed File System
In hive: Hadoop InteractiVE

Description Usage Arguments Details Value Author(s) References Examples

Functions providing high-level access to the Hadoop Distributed File System (HDFS).

DFS_cat( file, con = stdout(), henv = hive() )
DFS_delete( file, recursive = FALSE, henv = hive() )
DFS_dir_create( path, henv = hive() )
DFS_dir_exists( path, henv = hive() )
DFS_dir_remove( path, recursive = TRUE, henv = hive() )
DFS_file_exists( file, henv = hive() )
DFS_get_object( file, henv = hive() )
DFS_read_lines( file, n = -1L, henv = hive() )
DFS_rename( from, to, henv = hive() )
DFS_list( path = ".", henv = hive() )
DFS_tail( file, n = 6L, size = 1024L, henv = hive() )
DFS_put( files, path = ".", henv = hive() )
DFS_put_object( obj, file, henv = hive() )
DFS_write_lines( text, file, henv = hive() )

`henv`	An object containing the local Hadoop configuration.
`file`	a character string representing a file on the DFS.
`files`	a character string representing files located on the local file system to be copied to the DFS.
`n`	an integer specifying the number of lines to read.
`obj`	an R object to be serialized to/from the DFS.
`path`	a character string representing a full path name in the DFS (without the leading `hdfs://`); for many functions the default corresponds to the user's home directory in the DFS.
`recursive`	logical. Should elements of the path other than the last be deleted recursively?
`size`	an integer specifying the number of bytes to be read. Must be sufficiently large otherwise `n` does not have the desired effect.
`text`	a (vector of) character string(s) to be written to the DFS.
`con`	A connection to be used for printing the output provided by `cat`. Default: standard output connection, has currently no other effect
`from`	a character string representing a file or directory on the DFS to be renamed.
`to`	a character string representing the new filename on the DFS.

The Hadoop Distributed File System (HDFS) is typically part of a Hadoop cluster or can be used as a stand-alone general purpose distributed file system (DFS). Several high-level functions provide easy access to distributed storage.

DFS_cat is useful for producing output in user-defined functions. It reads from files on the DFS and typically prints the output to the standard output. Its behaviour is similar to the base function cat.

DFS_dir_create creates directories with the given path names if they do not already exist. It's behaviour is similar to the base function dir.create.

DFS_dir_exists and DFS_file_exists return a logical vector indicating whether the directory or file respectively named by its argument exist. See also function file.exists.

DFS_dir_remove attempts to remove the directory named in its argument and if recursive is set to TRUE also attempts to remove subdirectories in a recursive manner.

DFS_list produces a character vector of the names of files in the directory named by its argument.

DFS_read_lines is a reader for (plain text) files stored on the DFS. It returns a vector of character strings representing lines in the (text) file. If n is given as an argument it reads that many lines from the given file. It's behaviour is similar to the base function readLines.

DFS_put copies files named by its argument to a given path in the DFS.

DFS_put_object serializes an R object to the DFS.

DFS_write_lines writes a given vector of character strings to a file stored on the DFS. It's behaviour is similar to the base function writeLines.

DFS_delete(), DFS_dir_create(), and DFS_dir_remove return a logical value indicating if the operation succeeded for the given argument.

DFS_dir_exists() and DFS_file_exists() return TRUE if the named directories or files exist in the HDFS.

DFS_get__object() returns the deserialized object stored in a file on the HDFS.

DFS_list() returns a character vector representing the directory listing of the corresponding path on the HDFS.

DFS_read_lines() returns a character vector of length the number of lines read.

DFS_tail() returns a character vector of length the number of lines to read until the end of a file on the HDFS.

Stefan Theussl

The Hadoop Distributed File System (https://hadoop.apache.org/docs/current/hadoop-project-dist/hadoop-hdfs/HdfsDesign.html).

## Do we have access to the root directory of the DFS?
## Not run: DFS_dir_exists("/")
## Some self-explanatory DFS interaction
## Not run: 
DFS_list( "/" )
DFS_dir_create( "/tmp/test" )
DFS_write_lines( c("Hello HDFS", "Bye Bye HDFS"), "/tmp/test/hdfs.txt" )
DFS_list( "/tmp/test" )
DFS_read_lines( "/tmp/test/hdfs.txt" )

## End(Not run)
## Serialize an R object to the HDFS
## Not run: 
foo <- function()
"You got me serialized."
sro <- "/tmp/test/foo.sro"
DFS_put_object(foo, sro)
DFS_get_object( sro )()

## End(Not run)
## finally (recursively) remove the created directory
## Not run: DFS_dir_remove( "/tmp/test" )