summarise: Summarise multiple values to a single value

Description Usage Arguments Details Value See Also Examples

Description

Summarise multiple values to a single value

Usage

1
2
3
4
5
6
## S3 method for class 'RxFileData'
summarise(.data, ..., .outFile = tbl_xdf(.data), .rxArgs,
  .method = NULL)

## S3 method for class 'RxDataSource'
summarise(.data, ...)

Arguments

.data

A tbl for an Xdf data source; or a raw Xdf data source.

...

Name-value pairs of summary functions like min(), mean(), max() etc.

.outFile

Output format for the returned data. If not supplied, create an xdf tbl; if NULL, return a data frame; if a character string naming a file, save an Xdf file at that location.

.rxArgs

A list of RevoScaleR arguments. See rxArgs for details.

Details

There are 5 possible methods for doing the summarisation. To choose which method is used, specify a .method argument in the call to summarise, with a number from 1 to 5.

  1. use rxCube, cbind data frames together: only n(), mean(), sum() supported, grouped data only (fast)

  2. use rxSummary, cbind data frames together: stats in rxSummary supported (fast)

  3. as 2), but build classification levels by pasting the grouping variable(s) together (moderately fast)

  4. split into multiple Xdfs by group, run dplyr::summarise on each, rbind xdfs together: arbitrary stats supported (slow)

  5. split into multiple Xdfs by group, run rxSummary on each, rbind xdfs together: stats in rxSummary supported (slowest, most scalable)

The default method is 1 if the data is grouped and the requested summary statistics are supported by rxCube; otherwise 2 if the requested statistics are supported by rxSummary; otherwise 4. Method 3 is supplied for the case where the product of factor levels for the grouping variables exceeds 2^32 - 1, a known limitation of rxCube and rxSummary.

Supplying custom functions to summarise is supported, but they must be named functions (and will automatically cause .method=4 to be selected). Anonymous functions will cause an error.

Due to limitations in RevoScaleR support for HDFS, you should take note of the following:

Value

An object representing the summary. This depends on the .outFile argument: if missing, it will be an xdf tbl object; if NULL, a data frame; and if a filename, an Xdf data source referencing a file saved to that location.

See Also

summarise in package dplyr, rxCube, rxSummary

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
mtx <- as_xdf(mtcars, overwrite=TRUE)

tbl <- summarise(mtx, m=mean(mpg))
as.data.frame(tbl)

tbl2 <- group_by(mtx, cyl) %>% summarise(m=mean(mpg))
as.data.frame(tbl2)

# filter and summarise simultaneously with .rxArgs
tbl3 <- summarise(mtx, m=mean(mpg), .rxArgs=list(rowSelection=cyl > 4))
as.data.frame(tbl3)

# compute a weighted mean
tbl4 <- summarise(mtx, m=mean(mpg), .rxArgs=list(pweights="wt"))
as.data.frame(tbl4)

# save to a persistent Xdf file
summarise(mtx, m=mean(mpg), .outFile="mtcars_summary.xdf")

RevolutionAnalytics/dplyrXdf documentation built on June 3, 2019, 9:08 p.m.