Description Usage Arguments Details Value See Also Examples
Summarise multiple values to a single value
1 2 3 4 5 6 |
.data |
A tbl for an Xdf data source; or a raw Xdf data source. |
... |
Name-value pairs of summary functions like |
.outFile |
Output format for the returned data. If not supplied, create an xdf tbl; if |
.rxArgs |
A list of RevoScaleR arguments. See |
There are 5 possible methods for doing the summarisation. To choose which method is used, specify a .method
argument in the call to summarise
, with a number from 1 to 5.
use rxCube
, cbind data frames together: only n()
, mean()
, sum()
supported, grouped data only (fast)
use rxSummary
, cbind data frames together: stats in rxSummary supported (fast)
as 2), but build classification levels by pasting the grouping variable(s) together (moderately fast)
split into multiple Xdfs by group, run dplyr::summarise
on each, rbind xdfs together: arbitrary stats supported (slow)
split into multiple Xdfs by group, run rxSummary
on each, rbind xdfs together: stats in rxSummary supported (slowest, most scalable)
The default method is 1 if the data is grouped and the requested summary statistics are supported by rxCube
; otherwise 2 if the requested statistics are supported by rxSummary
; otherwise 4. Method 3 is supplied for the case where the product of factor levels for the grouping variables exceeds 2^32 - 1, a known limitation of rxCube
and rxSummary
.
Supplying custom functions to summarise is supported, but they must be named functions (and will automatically cause .method=4
to be selected). Anonymous functions will cause an error.
Due to limitations in RevoScaleR support for HDFS, you should take note of the following:
The result of the summarise will be streamed to the client (either the edge node or a remote client) before being written back to HDFS.
If summarising over character grouping variables, it may be faster to specify .method=4
or 5
. This is because the usual summarise functions, rxSummary
and rxCube
, require factor or numeric groups, and converting character to factor can be slow for HDFS data.
An object representing the summary. This depends on the .outFile
argument: if missing, it will be an xdf tbl object; if NULL
, a data frame; and if a filename, an Xdf data source referencing a file saved to that location.
summarise
in package dplyr, rxCube
, rxSummary
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | mtx <- as_xdf(mtcars, overwrite=TRUE)
tbl <- summarise(mtx, m=mean(mpg))
as.data.frame(tbl)
tbl2 <- group_by(mtx, cyl) %>% summarise(m=mean(mpg))
as.data.frame(tbl2)
# filter and summarise simultaneously with .rxArgs
tbl3 <- summarise(mtx, m=mean(mpg), .rxArgs=list(rowSelection=cyl > 4))
as.data.frame(tbl3)
# compute a weighted mean
tbl4 <- summarise(mtx, m=mean(mpg), .rxArgs=list(pweights="wt"))
as.data.frame(tbl4)
# save to a persistent Xdf file
summarise(mtx, m=mean(mpg), .outFile="mtcars_summary.xdf")
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.