write_udaf_scripts: Writes User Defined Aggregation Function

Description Usage Arguments Details Value Examples

View source: R/udaf.R

Description

Generates R and SQL scripts to call as user defined aggregation functions in Hive

Usage

1
2
3
4
5
write_udaf_scripts(f, cluster_by, input_table, input_cols, input_classes,
  output_table, output_cols, output_classes, base_name = "udaf",
  include_script = NULL, overwrite_script = FALSE,
  overwrite_table = FALSE, rows_per_chunk = 1000000L, sep = "'\\t'",
  verbose = FALSE, try = FALSE, tmptable = "tmp")

Arguments

f

function which accepts a grouped data frame and returns a data frame

cluster_by

character name of column to CLUSTER BY, ie. split the main table based on this column and apply f to each group

input_table

character name of table to be transformed, ie. SELECT input_cols FROM input_table. Can also contain more SQL, such as input_table WHERE col1 < 10.

input_cols

input column names. See col.names in read.table.

input_classes

character vector of classes for columns. See colClasses in read.table.

output_table

character name of table to INSERT INTO output_table

output_cols

character vector of columns that f will output

base_name

character base name of script to write ie. foo.R and foo.sql

include_script

character name of an R script to include in the generated script. This may contain supporting functions, for example.

overwrite_script

logical write over any existing scripts with base_name?

overwrite_table

first call DROP TABLE output_table, and then CREATE TABLE output_table with appropriate column types?

rows_per_chunk

integer number of rows to process in each chunk. If this is too small, say 10, then the generated script will be slow. If this is too large, say 1 billion, then the R process may fail because it uses excessive memory.

sep

character field separator string

verbose

logical log messages to stderr so that they can be examined later via $ yarn logs -applicationId <your app id> -log_files stderr

try

logical If try = TRUE then the script will attempt to call f on every group, and ignore those groups that fail. If try = FALSE then a failure on any group will cause the whole Hive job to fail.

tmptable

character name of temporary table in SQL query

Details

This approach splits the data based on the value of the column cluster_by. Each group of split data must be small enough to fit in memory of the R process that runs it.

This function is relatively low level. It provides the foundation for something more advanced that knows and uses the schema of the database. Defaults were chosen to do the least destructive things possible, so they don't overwrite existing files and data.

Feedback:

Do I attempt to have consistency with similar funcs / packages? Ie. DBI package uses statement, lapply uses FUN

Alternatively I could use caps to denote SQL things, ie. CLUSTER_BY

Value

scripts character vector containing generated scripts

Examples

1
#write_udaf_scripts(...)

clarkfitzg/RHive documentation built on May 29, 2019, 12:37 p.m.