Description Usage Arguments Details Value Note Author(s) Examples
hmr
runs a chunk-wise Hadoop job.
hpath
and hinput
define HDFS file path and input source.
1 2 3 4 5 6 7 | hmr(input, output, map = identity, reduce = identity, job.name,
aux, formatter, packages = loadedNamespaces(), reducers,
wait=TRUE, hadoop.conf, hadoop.opt, R="R",
verbose = TRUE, persistent = FALSE, overwrite = FALSE,
use.kinit = !is.null(getOption("hmr.kerberos.realm")))
hpath(path)
hinput(path, formatter = .default.formatter)
|
input |
input data - see details |
output |
output path (optional) |
map |
chunk compute function (map is a misnomer) |
reduce |
chunk combine function |
job.name |
name of the job to pass to Hadoop |
aux |
either a character vector of symbols names or a named list of values to push to the compute nodes |
formatter |
formatter to use. It is optional in |
packages |
character vector of package names to attach on the compute nodes |
reducers |
optional integer specifying the number of parallel jobs in the combine step. It is a hint in the sense that any number greater than one implies independence of the chunks in the combine step. Default is to not assume independence. |
wait |
logical, if |
hadoop.conf |
optional string, path to the hadoop configuration directory for submission |
hadoop.opt |
additional Java options to pass to the job - named
character vectors are passed as |
R |
command to call to run R on the Hadoop cluster |
verbose |
logical, indicating whether the output sent to standard error and standard out from hadoop should be printed to the console. |
persistent |
logical, if |
overwrite |
logical, if |
use.kinit |
logical, if |
path |
HDFS path |
hmr
creates and runs a Hadoop job to perform chunkwise compute
+ combine. The input is read using chunk.reader
,
processed using the formatter
function and passed to the
map
function. The result is converted using as.output
before going back to Hadoop. The chunkwise results are combined using
the reduce
function - the flow is the same as in the map
case. Then result is returned as HDFS path. Either map
or
reduce
can be identity
(the default).
If the formatter
if omitted then the format is taken from
input object (if it has one) or the default formatter
(mstrsplit
with '\t'
as key spearator, '|'
as
column separator) is used. If formater is a function then the same
formatter is used for both the map and reduce steps. If separate
formatters are required, the formatter
can be a list with the
entries map
and/or reduce
specifying the corresponding
formatter function.
hpath
tags a string as HDFS path. The sole purpose here is to
distiguish local and HDFS paths.
hinput
creates a subclass of HDFSpath
which also
contains the definition of the formatter for that path. The default
formatter honors default Hadoop settings of '\t'
as the
key/value separator and '|'
as the field separator.
hmr
returns the HDFS path to the result when finished.
hpath
returns a character vector of class "HDFSpath"
hinput
returns a subclass "hinput"
of "HDFSpath"
containing the additional "formatter"
attribute.
Requires properly installed Hadoop client. The installation must
either be in /usr/lib/hadoop
or one of HADOOP_HOME
,
HADOOP_PREFIX
environment variables must be set accordingly.
Simon Urbanek
1 2 3 4 5 6 7 8 9 10 11 12 13 14 | ## Not run:
## map points to ZIP codes and count the number of points per ZIP
## uses Tiger/LINE 2010 census data shapefiles
## we can use ctapply becasue Hadoop guarantees contiguous input
## require(fastshp); require(tl2010)
r <- hmr(
hinput("/data/points"),
map = function(x)
table(zcta2010.db()[
inside(zcta2010.shp(), x[,4], x[,5]), 1]),
reduce = function(x) ctapply(as.numeric(x), names(x), sum))
## End(Not run)
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.