Description Usage Arguments Details Value See Also Examples
Summarise multiple values to a single value
1 2 3 4 5 6 |
.data |
A tbl for an Xdf data source; or a raw Xdf data source. |
... |
Name-value pairs of summary functions like |
.outFile |
Output format for the returned data. If not supplied, create an xdf tbl; if |
.rxArgs |
A list of RevoScaleR arguments. See |
There are 5 possible methods for doing the summarisation. To choose which method is used, specify a .method argument in the call to summarise, with a number from 1 to 5.
use rxCube, cbind data frames together: only n(), mean(), sum() supported, grouped data only (fast)
use rxSummary, cbind data frames together: stats in rxSummary supported (fast)
as 2), but build classification levels by pasting the grouping variable(s) together (moderately fast)
split into multiple Xdfs by group, run dplyr::summarise on each, rbind xdfs together: arbitrary stats supported (slow)
split into multiple Xdfs by group, run rxSummary on each, rbind xdfs together: stats in rxSummary supported (slowest, most scalable)
The default method is 1 if the data is grouped and the requested summary statistics are supported by rxCube; otherwise 2 if the requested statistics are supported by rxSummary; otherwise 4. Method 3 is supplied for the case where the product of factor levels for the grouping variables exceeds 2^32 - 1, a known limitation of rxCube and rxSummary.
Supplying custom functions to summarise is supported, but they must be named functions (and will automatically cause .method=4 to be selected). Anonymous functions will cause an error.
Due to limitations in RevoScaleR support for HDFS, you should take note of the following:
The result of the summarise will be streamed to the client (either the edge node or a remote client) before being written back to HDFS.
If summarising over character grouping variables, it may be faster to specify .method=4 or 5. This is because the usual summarise functions, rxSummary and rxCube, require factor or numeric groups, and converting character to factor can be slow for HDFS data.
An object representing the summary. This depends on the .outFile argument: if missing, it will be an xdf tbl object; if NULL, a data frame; and if a filename, an Xdf data source referencing a file saved to that location.
summarise in package dplyr, rxCube, rxSummary
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 | mtx <- as_xdf(mtcars, overwrite=TRUE)
tbl <- summarise(mtx, m=mean(mpg))
as.data.frame(tbl)
tbl2 <- group_by(mtx, cyl) %>% summarise(m=mean(mpg))
as.data.frame(tbl2)
# filter and summarise simultaneously with .rxArgs
tbl3 <- summarise(mtx, m=mean(mpg), .rxArgs=list(rowSelection=cyl > 4))
as.data.frame(tbl3)
# compute a weighted mean
tbl4 <- summarise(mtx, m=mean(mpg), .rxArgs=list(pweights="wt"))
as.data.frame(tbl4)
# save to a persistent Xdf file
summarise(mtx, m=mean(mpg), .outFile="mtcars_summary.xdf")
|
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.