do: Do arbitrary operations on a tbl

Description Usage Arguments Details Value See Also Examples

Description

The do verb converts the data to a data frame before running the operations. The doXdf verb keeps the data in Xdf format, so is not (as) limited by memory.

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
## S3 method for class 'RxFileData'
do(.data, ...)

## S3 method for class 'grouped_tbl_xdf'
do(.data, ...)

do_xdf(.data, ...)

doXdf(.data, ...)

## S3 method for class 'RxFileData'
do_xdf(.data, ...)

## S3 method for class 'grouped_tbl_xdf'
do_xdf(.data, ...)

## S3 method for class 'RxDataSource'
do(.data, ...)

## S3 method for class 'RxDataSource'
do_xdf(.data, ...)

Arguments

.data

A tbl for an Xdf data source; or a raw Xdf data source.

...

Expressions to apply.

Details

The difference between the do and do_xdf verbs is that the former converts the data into a data frame before running the expressions on it; while the latter passes the data as Xdf files. do is thus more flexible in the expressions it can run (basically anything that works with data frames), whereas do_xdf is better able to handle large datasets. The final output from do_xdf must still be able to fit in memory (see below).

do_xdf was called doXdf in previous versions of this package; it has been renamed to match dplyr's snake_case naming convention.

To run expressions on a grouped Xdf tbl, do and do_xdf split the data into one file per group, and the arguments are called on each file. Note however this may be slow if you have a large number of groups; and, for do, the operation will be limited by memory if the number of rows per group is large.

Value

The do and do_xdf verbs always return a data frame, unlike the other verbs for Xdf objects. This is because they are meant to execute code that can return arbitrarily complex objects, and Xdf files can only store atomic data.

See Also

do in package dplyr

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
mtx <- as_xdf(mtcars, overwrite=TRUE)

# unnamed arg
do(mtx, {
    mpg2 <- 2 * .$mpg
    cyl2 <- 2 * .$cyl
    .
})

do_xdf(mtx, rxDataStep(., transformFunc=function(.data) {
    .data$mpg2 <- 2 * .data$mpg
    .data$cyl2 <- 2 * .data$cyl
    .data
}))

# named arg
do(mtx, m=lm(mpg ~ cyl, data=.))

do_xdf(mtx, m=rxLinMod(mpg ~ cyl, data=.))

# fitting multiple models to subsets of the data
if(require("nycflights13")) {
flx <- as_xdf(flights, overwrite=TRUE)
flx %>%
    group_by(carrier) %>%
    do(m=lm(arr_delay ~ dep_time, data=.))

# with do_xdf: useful if each subset is very large, but called code must be Xdf-aware
flx %>%
    group_by(carrier) %>%
    do_xdf(m2=rxLinMod(arr_delay ~ dep_time, data=.))
}

RevolutionAnalytics/dplyrXdf documentation built on June 3, 2019, 9:08 p.m.