write_udaf_scripts: Writes User Defined Aggregation Function
In clarkfitzg/RHive: Run R Code Through Apache Hive

Description Usage Arguments Details Value Examples

View source: R/udaf.R

Generates R and SQL scripts to call as user defined aggregation functions in Hive

write_udaf_scripts(f, cluster_by, input_table, input_cols, input_classes,
  output_table, output_cols, output_classes, base_name = "udaf",
  include_script = NULL, overwrite_script = FALSE,
  overwrite_table = FALSE, rows_per_chunk = 1000000L, sep = "'\\t'",
  verbose = FALSE, try = FALSE, tmptable = "tmp")

`f`	function which accepts a grouped data frame and returns a data frame
`cluster_by`	character name of column to `CLUSTER BY`, ie. split the main table based on this column and apply `f` to each group
`input_table`	character name of table to be transformed, ie. `SELECT input_cols FROM input_table`. Can also contain more SQL, such as `input_table WHERE col1 < 10`.
`input_cols`	input column names. See `col.names` in `read.table`.
`input_classes`	character vector of classes for columns. See `colClasses` in `read.table`.
`output_table`	character name of table to `INSERT INTO output_table`
`output_cols`	character vector of columns that f will output
`base_name`	character base name of script to write ie. foo.R and foo.sql
`include_script`	character name of an R script to include in the generated script. This may contain supporting functions, for example.
`overwrite_script`	logical write over any existing scripts with `base_name`?
`overwrite_table`	first call `DROP TABLE output_table`, and then `CREATE TABLE output_table` with appropriate column types?
`rows_per_chunk`	integer number of rows to process in each chunk. If this is too small, say 10, then the generated script will be slow. If this is too large, say 1 billion, then the R process may fail because it uses excessive memory.
`sep`	character field separator string
`verbose`	logical log messages to `stderr` so that they can be examined later via `$ yarn logs -applicationId <your app id> -log_files stderr`
`try`	logical If `try = TRUE` then the script will attempt to call `f` on every group, and ignore those groups that fail. If `try = FALSE` then a failure on any group will cause the whole Hive job to fail.
`tmptable`	character name of temporary table in SQL query

This approach splits the data based on the value of the column cluster_by. Each group of split data must be small enough to fit in memory of the R process that runs it.

This function is relatively low level. It provides the foundation for something more advanced that knows and uses the schema of the database. Defaults were chosen to do the least destructive things possible, so they don't overwrite existing files and data.

Feedback:

Do I attempt to have consistency with similar funcs / packages? Ie. DBI package uses statement, lapply uses FUN

Alternatively I could use caps to denote SQL things, ie. CLUSTER_BY

scripts character vector containing generated scripts