group_by: Group an Xdf file by one or more variables
In RevolutionAnalytics/dplyrXdf: Tools for working with Microsoft R Server Xdf files and the dplyr package

Description Usage Arguments Details Parallel by-group processing See Also Examples

Group an Xdf file by one or more variables

## S3 method for class 'RxFileData'
group_by(.data, ..., add = FALSE)

## S3 method for class 'RxXdfData'
group_by(.data, ..., add = FALSE)

## S3 method for class 'tbl_xdf'
group_by(.data, ..., add = FALSE)

## S3 method for class 'grouped_tbl_xdf'
group_by(.data, ..., add = FALSE)

## S3 method for class 'RxDataSource'
group_by(.data, ...)

`.data`	An Xdf file or a tbl wrapping the same.
`...`	Variables to group by.
`add`	If FALSE (the default), `group_by` will ignore existing groups. If TRUE, add grouping variables to existing groups.

When called on an Xdf file, group_by does not do any data processing; it only sets up the necessary metadata for verbs accepting grouped tbls to handle the data correctly. When called on a non-Xdf data source, it imports the data into an Xdf tbl.

Note that by default, the levels of the grouping variables for Xdf files are unsorted. This is for performance reasons, to avoid having to make unnecessary passes through the data.

There are two options for handling grouped data: use the rxExecBy function supplied in the RevoScaleR package, or via dplyrXdf-internal code. The former is the default if the version of Microsoft R installed is 9.1 or higher.

Most verbs that have specific methods for grouped data will split the data into multiple Xdf files, and then process each file separately (the exception is summarise. This makes it easy to parallelise the processing of groups.

dplyrXdf will automatically use the RevoScaleR compute context that you set via rxSetComputeContext. For example, if you set the compute context to RxLocalParallel, dplyrXdf will create a cluster of slave nodes in the background and send the data to the nodes by group. The cluster is destroyed when the verb returns.

For more flexibility and scalability, you can set the compute context manually to RxForeachDoPar. This will create a persistent cluster that can be reused for multiple pipelines. The dopar compute context can also use clusters made up of multiple machines, not just multiple processes on the single machine; see below for an example of this.

The above applies to data stored in the native filesystem (either on your local machine or a network share). If your data is stored in a Hadoop or Spark cluster, dplyrXdf will similarly take advantage of the Hadoop and Spark compute contexts to process data in parallel on the worker nodes.

group_by in package dplyr, dplyrxdf_options for how to change the splitting procedure

mtx <- as_xdf(mtcars, overwrite=TRUE)
tbl <- group_by(mtx, cyl)
groups(tbl)
group_vars(tbl)

## parallel processing of groups with ForeachDoPar compute context
## Not run: 
flx <- as_tbl(nycflights13::flights)

doParallel::registerDoParallel(3)
dplyrxdf_options(useExecBy=FALSE)  # turn rxExecBy processing off
rxSetComputeContext("dopar")

flx %>%
    group_by(carrier) %>%
    do(m=lm(arr_time ~ dep_time + dep_delay + factor(month), data=.))

doParallel::stopImplicitCluster()

# ForeachDoPar also works with a cluster of multiple machines, not just multiple processes on one machine
# to work with dplyrXdf, all machines must have access to the same filesystem (eg a network share)
# doAzureParallel is available from GitHub: https://github.com/Azure/doAzureParallel
cl <- doAzureParallel::makeCluster("cluster.json")
doAzureParallel::registerDoAzureParallel(cl)

# set the dplyrXdf working directory to a cluster-accessible location
set_dplyrxdf_dir("n:/clusterdata")

flx2 <- as_xdf(nycflights13::flights, file="n:/clusterdata/flights.xdf")
flx2 %>%
    group_by(carrier) %>%
    do(m=lm(arr_time ~ dep_time + dep_delay + factor(month), data=.))

doAzureParallel::stopCluster(cl)
rxSetComputeContext("local")
dplyrxdf_options(useExecBy=TRUE)  # re-enable rxExecBy processing once we are done

## End(Not run)

RevolutionAnalytics/dplyrXdf documentation built on June 3, 2019, 9:08 p.m.

RevolutionAnalytics/dplyrXdf index

README.md

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

RevolutionAnalytics/dplyrXdf
Tools for working with Microsoft R Server Xdf files and the dplyr package

group_by: Group an Xdf file by one or more variables
In RevolutionAnalytics/dplyrXdf: Tools for working with Microsoft R Server Xdf files and the dplyr package

Description

Usage

Arguments

Details

Parallel by-group processing

See Also

Examples

Related to group_by in RevolutionAnalytics/dplyrXdf...

R Package Documentation

Browse R Packages

We want your feedback!

RevolutionAnalytics/dplyrXdf Tools for working with Microsoft R Server Xdf files and the dplyr package

group_by: Group an Xdf file by one or more variables In RevolutionAnalytics/dplyrXdf: Tools for working with Microsoft R Server Xdf files and the dplyr package

Description

Usage

Arguments

Details

Parallel by-group processing

See Also

Examples

Related to group_by in RevolutionAnalytics/dplyrXdf...

R Package Documentation

Browse R Packages

We want your feedback!

RevolutionAnalytics/dplyrXdf
Tools for working with Microsoft R Server Xdf files and the dplyr package

group_by: Group an Xdf file by one or more variables
In RevolutionAnalytics/dplyrXdf: Tools for working with Microsoft R Server Xdf files and the dplyr package