Mutate Partitioner
In seplyr: Improved Standard Evaluation Interfaces for Common Data Manipulation Tasks

knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>"
)

seplyr::partition_mutate_qt() is a service supplied by the package seplyr (version 0.5.0 or higher).

seplyr::partition_mutate_qt() can partition a sequence of assignments so that no statement is using any value created in the same partition element or group. The partitions are in a format accepted by seplyr::mutate_se() for execution.

For such a partition the evaluation result does not depend on the order of execution of the statements in each group (as they are all independent of each other's left-hand-sides). A no-dependency small number of groups partition is very helpful when executing expressions on SQL based data interfaces (such as Apache Spark).

The method used to partition expressions is to scan the remaining expressions in order taking any that: have all their values available from earlier groups, do not use a value formed in the current group, and do not overwrite a value formed in the current group.

This partitioning method can lead to far fewer groups than the straightforward method of breaking up the sequence of expressions at each new-value use.

Here is a non-trivial example (notice we use := for assignment, that is a requirement of seplyr::mutate_se()):

library("seplyr")

plan <- partition_mutate_qt(
  rand_a := rand(),
   choice_a := rand_a>=0.5, # first use of a new value 1
    a_1 := ifelse(choice_a, # first use of a new value 2
                  'treatment', 
                  'control'),
    a_2 := ifelse(choice_a, 
                  'control', 
                  'treatment'),
  rand_b := rand(),
   choice_b := rand_b>=0.5, # first use of a new value 3
    b_1 := ifelse(choice_b, # first use of a new value 4
                  'treatment', 
                  'control'),
    b_2 := ifelse(choice_b, 
                  'control', 
                  'treatment'),
  rand_c := rand(),
   choice_c := rand_c>=0.5, # first use of a new value 5
    c_1 := ifelse(choice_c, # first use of a new value 6
                  'treatment', 
                  'control'),
    c_2 := ifelse(choice_c, 
                  'control', 
                  'treatment'),
  rand_d := rand(),
   choice_d := rand_d>=0.5, # first use of a new value 7
    d_1 := ifelse(choice_d, # first use of a new value 8
                  'treatment', 
                  'control'),
    d_2 := ifelse(choice_d, 
                  'control', 
                  'treatment'),
  rand_e := rand(),
   choice_e := rand_e>=0.5, # first use of a new value 9
    e_1 := ifelse(choice_e, # first use of a new value 10
                  'treatment', 
                  'control'),
    e_2 := ifelse(choice_e, 
                  'control', 
                  'treatment')
  )

print(plan)

Notice seplyr::partition_mutate_qt() split the work into 3 groups. The straightforward method (with no statement re-ordering) of splitting into non-dependent groups would have to split the mutate at each first use of a new value: yielding 10 splits or 11 mutate stages. For why a low number of execution stages is important please see here.

To execute the statements on a data-item "d" (either an in-memory data.frame or a remote database or Sparklyr handle) we would do something like the following:

res <- mutate_seb(d, plan)

A fully worked version of this example can be found here.

Note re-using variable values does limit the planner's ability to efficiently partition the the statement. The planner still emits safe and correct code, but unless it were to be allowed to introduce new variable names it must break sequences in more places (note prior to seplyr version 0.5.2 some dependencies were missed, leading to possibly unsafe code). We show this effect below.

plan <- partition_mutate_qt(
  rand := rand(),
   choice := rand>=0.5, # first use of a new value 1
    a_1 := ifelse(choice, # first use of a new value 2
                  'treatment', 
                  'control'),
    a_2 := ifelse(choice, 
                  'control', 
                  'treatment'),
  rand := rand(),
   choice := rand>=0.5, # first use of a new value 3
    b_1 := ifelse(choice, # first use of a new value 4
                  'treatment', 
                  'control'),
    b_2 := ifelse(choice, 
                  'control', 
                  'treatment'),
  rand := rand(),
   choice := rand>=0.5, # first use of a new value 5
    c_1 := ifelse(choice, # first use of a new value 6
                  'treatment', 
                  'control'),
    c_2 := ifelse(choice, 
                  'control', 
                  'treatment'),
  rand := rand(),
   choice := rand>=0.5, # first use of a new value 7
    d_1 := ifelse(choice, # first use of a new value 8
                  'treatment', 
                  'control'),
    d_2 := ifelse(choice, 
                  'control', 
                  'treatment'),
  rand := rand(),
   choice := rand>=0.5, # first use of a new value 9
    e_1 := ifelse(choice, # first use of a new value 10
                  'treatment', 
                  'control'),
    e_2 := ifelse(choice, 
                  'control', 
                  'treatment')
  )

print(plan)

Any scripts or data that you put into this service are public.

seplyr documentation built on Sept. 5, 2021, 5:12 p.m.

rdrr.io home R language documentation Run R code online

CRAN packages Bioconductor packages R-Forge packages GitHub packages

Note that we can't provide technical support on individual packages. You should contact the package authors for that.

seplyr
Improved Standard Evaluation Interfaces for Common Data Manipulation Tasks

Mutate Partitioner
In seplyr: Improved Standard Evaluation Interfaces for Common Data Manipulation Tasks

Try the seplyr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

seplyr Improved Standard Evaluation Interfaces for Common Data Manipulation Tasks

Mutate Partitioner In seplyr: Improved Standard Evaluation Interfaces for Common Data Manipulation Tasks

Try the seplyr package in your browser

R Package Documentation

Browse R Packages

We want your feedback!

seplyr
Improved Standard Evaluation Interfaces for Common Data Manipulation Tasks

Mutate Partitioner
In seplyr: Improved Standard Evaluation Interfaces for Common Data Manipulation Tasks