collect: Pull a Hive table to a local data frame
In ZurichPA/honeycomb: dplyr backend for Hive

Description Usage Arguments Details Value

collect forces computation of a query and pulls the result down into a local data frame.

1 2	## S3 method for class 'tbl_Hive' collect(x, n, batch = 1e+05, quiet = TRUE, ...)

`x`	A `tbl_Hive` object
`n`	The number of rows to pull down
`batch`	Size of batches in which to fetch the data (passed to `hive_query`)
`quiet`	Whether to print progress updates as the data is being fetched
`...`	Additional arguments passed to `hive_query`

By default, collect will pull down the 100,000 rows of the result table. If the result table pulls back exactly 100,000 rows, a warning message will be printed.

To pull down all of the rows, n = Inf can be specified. When using collect, make sure to keep in mind how large the data set you're pulling down is, in regard to both the number of rows as well as the number of columns.

collect works by calling hive_query, so if necessary, you can specify the batch argument like you would in hive_query to avoid out-of-memory errors, and you can specify the quiet argument for whether to print update messages as the data is being pulled. In contrast to hive_query, for collect the quiet parameter is TRUE by default. You can set quiet = FALSE if you want messages to be printed.

To the extent that you can leave the data in Hive, it is best to do so. collect should only be called once you have a data set that has been filtered, aggregated, and narrowed down to the columns you need, such as a modeling data set.

A tibble of the result

ZurichPA/honeycomb documentation built on Aug. 29, 2020, 6:56 p.m.