README.md
In mitre/clusterconf: Interface with clusterconf.cluster packages to read cluster configurations

clusterconf

The clusterconf package extends the config package to provide a proposed standard for obtaining hadoop cluster configurations from R. This allows configuration details to be handled separately from functionality and alleviates the burden on end users to provide details (assuming one person per cluster bites the bullet for the greater good) when loading a feature package.

The gist of how this is done is by combining YAML configuration details (read by config) with R's built-in dependency management (i.e., install.packages). So long as child packages (e.g., clusterconf.mycluster) follow certain rules, then feature packages can build declare a dependency on clusterconf and then work with any cluster. The functional interface to ask for configuration values is exposed in the clusterconf packages and the actual configurations and java dependencies are provided via the child, cluster-specific, package.

It is unlikely than many (any) implementing packages will be publicly available since they contain information on particular hadoop clusters. Therefore we sketch out here what components are necessary.

Only one R function is required. The child package must implement get_cluster_name, which returns a string and that the return value of that function, once spaces are removed and all letters converted to lower-case, match the portion of the package name after the period. For example if get_cluster_name returns "My Cluster" then the R package name needs to be clusterconf.mycluster. The reason for the strict package naming convention is so that clusterconf::list_available_clusters can search both installed packages and packages in any configured repositories (i.e., return of getOption("repos")) for child packages.

The configuration package essentially just has a configuration YAML similar to the one above. When developing this file should be placed in ./inst/configs and any java dependencies should go in ./inst/java.

A YAML configuration file is the main component of the cluster-specific package. It should be placed in <package root>/inst/configs and may be named anything so long as it is a *.yaml file. Though it is unlikely that any one feature package will require all of the information in a configuration file it is good practice to include as much as possible so that a variety of feature packages all work. The configurations are accessed lazily, so errors will only result when a feature package attempts to use a missing configuration.

Below is a sample configuration file with notes (instead of values) about what is expected in each field.

default:
  cluster:  
    home: path to hadoop install on edge node e.g., "/usr/hdp/hadoop"
    ha_node: name of high availability node, eg, "mycluster-ha"
    name_node: names of name nodes, e.g., ["mycluster-nn1", "mycluster-nn2"]
    edge_node: name of edge node, e.g., "mycluster-gw"
    edge_port: port to use to connect to edge node, e.g., 22
  yarn:
    conf_dir: path to yarn configurations on edge node, e.g., "/etc/hadoop/conf"
  spark:
    home: path to spark installation on edge node, e.g., "/usr/hdp/hadoop/spark"
    conf_dir: path to spark configurations on edge node, e.g., "/etc/spark/conf"
    packages: default spark packages to include on the path when starting a session, e.g., ["com.databricks:spark-avro_2.10:2.0.1", "com.databricks:spark-csv_2.10:1.5.0"]
  spark_sql:
    host: e.g., "mycluster-rm1"
    port: e.g., "10001"
    driver: driver class name, e.g., "org.apache.hive.jdbc.HiveDriver"
    classpath: driver classpath, e.g., ["hive-jdbc.jar", "hadoop-common-2.6.0.2.2.4.2-2.jar"]
  hive: 
    pw_required: default authentication required setting, e.g., false
    host: e.g., "mycluster-rm2"
    port: e.g., "10010"
    driver: driver class name, e.g., "org.apache.hive.jdbc.HiveDriver"
    classpath: driver classpath, e.g., ["hive-jdbc.jar", "hadoop-common-2.6.0.2.2.4.2-2.jar"]
  drill:
    storage: default storage name (from drill UI), e.g., "dfs"
    hive_storage: name of hive storage plugin (for querying Hive metastore)
    url: connection URL (equivalent to host + port for hive but generally more complicated for drill)
    driver: driver class name, e.g., "org.apache.drill.jdbc.Driver"
    classpath: driver classpath, e.g., "drill-jdbc-all-1.6.0.jar"
  webhdfs:
    port: connection port, e.g., "50070"
    suffix: the part of the webhdfs URL after the port, e.g., "webhdfs/v1"
  tools:
    avro: path to avro-tools on edge node, e.g., "/lib/avro-tools-1.7.7.jar"
    parquet: path to parquet-tools on edge node, e.g., "/lib/parquet-tools-1.8.1.jar"
  resources:
    package: package containing java resources, e.g., "clusterconf.myothercluster"
    directory: a directory containing java resources

A few notes:

Not all configurations are required to make a given connection. This configuration file is also used by other HDFS-integration packages. For example, the entire spark and yarn sections could be omitted if SQL on hadoop is the only use case.
The resources section needs an entry in either the package or directory field. Not both.
The package option in the resources section enables less duplication of java dependencies. Thus, for example, if your organization has several hadoop clusters then clusterconf.cluster2 can declare a dependency (in the imports section of the DESCRIPTION file) on clusterconf.cluster1 and the cluster 1 package can be the only one that is bloated with driver jars and other such dependencies.
The package option in the resources section may be self referential. Extending the example from above, the line in the YAML file may read: package: clusterconf.cluster1 for both the cluster 1 package and the cluster 2 package.
Make sure that there is a line break to end the file. If the file ends on a line with text it will not break things, but a warning will be printed every time it is parsed.

Java dependencies should be placed in <package root>/inst/java. It is expected that exact name matches should be found. Using the above YAML configuration as an example, if the package reference was self referential then clusterconf will expect that hive-jdbc.jar will exactly be found in the aforementioned directory.

mitre/clusterconf documentation built on June 8, 2020, 2:43 a.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

mitre/clusterconf
Interface with clusterconf.cluster packages to read cluster configurations

README.md
In mitre/clusterconf: Interface with clusterconf.cluster packages to read cluster configurations

clusterconf

Purpose

Approach

The Child Package

R Code & Package Naming Convention

YAML Configuration

Java Dependencies

R Package Documentation

Browse R Packages

We want your feedback!

mitre/clusterconf Interface with clusterconf.cluster packages to read cluster configurations

README.md In mitre/clusterconf: Interface with clusterconf.cluster packages to read cluster configurations

clusterconf

Purpose

Approach

The Child Package

R Code & Package Naming Convention

YAML Configuration

Java Dependencies

R Package Documentation

Browse R Packages

We want your feedback!

mitre/clusterconf
Interface with clusterconf.cluster packages to read cluster configurations

README.md
In mitre/clusterconf: Interface with clusterconf.cluster packages to read cluster configurations