group_by: Group an Xdf file by one or more variables

Description Usage Arguments Details Parallel by-group processing See Also Examples

Description

Group an Xdf file by one or more variables

Usage

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
## S3 method for class 'RxFileData'
group_by(.data, ..., add = FALSE)

## S3 method for class 'RxXdfData'
group_by(.data, ..., add = FALSE)

## S3 method for class 'tbl_xdf'
group_by(.data, ..., add = FALSE)

## S3 method for class 'grouped_tbl_xdf'
group_by(.data, ..., add = FALSE)

## S3 method for class 'RxDataSource'
group_by(.data, ...)

Arguments

.data

An Xdf file or a tbl wrapping the same.

...

Variables to group by.

add

If FALSE (the default), group_by will ignore existing groups. If TRUE, add grouping variables to existing groups.

Details

When called on an Xdf file, group_by does not do any data processing; it only sets up the necessary metadata for verbs accepting grouped tbls to handle the data correctly. When called on a non-Xdf data source, it imports the data into an Xdf tbl.

Note that by default, the levels of the grouping variables for Xdf files are unsorted. This is for performance reasons, to avoid having to make unnecessary passes through the data.

There are two options for handling grouped data: use the rxExecBy function supplied in the RevoScaleR package, or via dplyrXdf-internal code. The former is the default if the version of Microsoft R installed is 9.1 or higher.

Parallel by-group processing

Most verbs that have specific methods for grouped data will split the data into multiple Xdf files, and then process each file separately (the exception is summarise. This makes it easy to parallelise the processing of groups.

dplyrXdf will automatically use the RevoScaleR compute context that you set via rxSetComputeContext. For example, if you set the compute context to RxLocalParallel, dplyrXdf will create a cluster of slave nodes in the background and send the data to the nodes by group. The cluster is destroyed when the verb returns.

For more flexibility and scalability, you can set the compute context manually to RxForeachDoPar. This will create a persistent cluster that can be reused for multiple pipelines. The dopar compute context can also use clusters made up of multiple machines, not just multiple processes on the single machine; see below for an example of this.

The above applies to data stored in the native filesystem (either on your local machine or a network share). If your data is stored in a Hadoop or Spark cluster, dplyrXdf will similarly take advantage of the Hadoop and Spark compute contexts to process data in parallel on the worker nodes.

See Also

group_by in package dplyr, dplyrxdf_options for how to change the splitting procedure

Examples

 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
mtx <- as_xdf(mtcars, overwrite=TRUE)
tbl <- group_by(mtx, cyl)
groups(tbl)
group_vars(tbl)

## parallel processing of groups with ForeachDoPar compute context
## Not run: 
flx <- as_tbl(nycflights13::flights)

doParallel::registerDoParallel(3)
dplyrxdf_options(useExecBy=FALSE)  # turn rxExecBy processing off
rxSetComputeContext("dopar")

flx %>%
    group_by(carrier) %>%
    do(m=lm(arr_time ~ dep_time + dep_delay + factor(month), data=.))

doParallel::stopImplicitCluster()

# ForeachDoPar also works with a cluster of multiple machines, not just multiple processes on one machine
# to work with dplyrXdf, all machines must have access to the same filesystem (eg a network share)
# doAzureParallel is available from GitHub: https://github.com/Azure/doAzureParallel
cl <- doAzureParallel::makeCluster("cluster.json")
doAzureParallel::registerDoAzureParallel(cl)

# set the dplyrXdf working directory to a cluster-accessible location
set_dplyrxdf_dir("n:/clusterdata")

flx2 <- as_xdf(nycflights13::flights, file="n:/clusterdata/flights.xdf")
flx2 %>%
    group_by(carrier) %>%
    do(m=lm(arr_time ~ dep_time + dep_delay + factor(month), data=.))

doAzureParallel::stopCluster(cl)
rxSetComputeContext("local")
dplyrxdf_options(useExecBy=TRUE)  # re-enable rxExecBy processing once we are done

## End(Not run)

RevolutionAnalytics/dplyrXdf documentation built on June 3, 2019, 9:08 p.m.