In SciViews/flow: Data Analysis Work Flow and Pipeline Operator for 'SciViews::R'

knitr::opts_chunk$set(collapse = TRUE, comment = "#>")
library(svFlow)

Pure, predictable, pipeable, ... and data-aware

Hadley Wickham advocates for pure, predictable and pipeable functions in the tidyverse. Although non-standard evaluation (NSE) makes many of the functions in tidyverse not referentially transparent (which makes them more difficult in reusable contexts like a function) this also contributes to a cleaner language, at least for beginneRs. With {svFlow} we want both to make tidyverse-style NSE more easily reusable, and the data analysis workflow based on pipelines and the pipe operator ({magrittr}'s pipe operator in tidyverse) even more data-aware.

Choice of the name

The initial name was {flow}, because it is short. However, it is now used by an,other package on CRAN. Other names as {workflow} and {workplan} are also already used. So, {svFlow} for SciViews' flow.

The {performanceEstimation} package has {Workflow} object and a Workflow() function. Also, the {zoon} package has a workflow() function, but it creates a {zoonWorkflow} object, so no clash here. In the {drake} package, now superseded by {targets}, there is also a workflow() function, but deprecated in favor of workplan(). {targets} is used to organize different analyses in data.frames. Hence, as we see here, {workflow} or {workplan} names are already pretty much used in the R ecosystem.

There is the {flowr} package which uses (internal) flow(), and is.flow() functions, and a flow S3 object. This is for complex, bioinformatics (work)flows, but of course, the source of potential problems when both {flowr} and {svFlow} packages are used simultaneously, if both objects bear the same class name. That is why in {svFlow}, objects are named Flow with an uppercase F, to avoid such a conflict.

A simpler and more efficient pipe operator?

The {wrapr} package provides an alternate pipe operator: %.>%, the "dot arrow pipe". It is very simple:

"a %.>% b" is to be treated as if the user had written "{ . \<- a; b };" with "%.>%" being treated as left-associative.

There are three interesting points with this pipe operator:

It does not alter the expression evaluated, and the dot can be placed everywhere in the expression. It means that any expression is suitable and is "pipe-aware" with this operator.
It is explicit, that is, you don't have to guess what will happen, the location of the dot replacement(s) in the expression is explicitly indicated. It should makes it also easier to understand from a beginneR's point of view, providing he is not used to {magrittr} style, of course.
Since the expression is not reworked, it is very fast in comparison to the complex-rules that must be computed each time you call {magrittr}'s %>%.
On the contrary to the base R pipe operator |> introduced in R 4.1.0, it is not just a syntactic flavor that transforms the code into imbricated functions calls internally. The base R pipe operator has many advantages, but also many limitations that %>.% tries to eliminate.

The only drawback with this pipe operator is that it is not pure, since it modifies the calling environment (it assigns . in it before evaluation of the right-hand side expression). However, if you never use . as a name for other objects, this is not much a problem. In {wrapr}, there is a synonym: %>.%, but that its author never uses in the examples, vignettes or on its blog. So, we decide to reuse %>.% as our pipe operator in {svFlow}. We add two things in it:

It is also aware of Flow objects (see here under) and behaves accordingly,
The expression to be evaluated is also recorded in the calling environment as .call. This way, it becomes easy to debug the last expression that failed during the pipeline execution (since . is also available, one can inspect it, or rerun eval(.call), ... or use debug_flow() to get extra information):

library(svFlow)
# An example pipeline with an error in the middle:
library(dplyr)
iris %>.%
  filter(., Sepal.Length < 5.1, Sepal.Width < 3.1) %>.%
  mutate(., logS = log(Species)) %>.%
  group_by(., Species) %>.%
  summarise(., mean_logS = mean(logS))

# Both . and .call are available and can be explored
head(.)
.call
eval(.call)

... or even more easily:

debug_flow()

From there, you can manipulate ., .call, or both, and rerun debug_flow() to fix the pipeline.

Mixing Pipe() and proto(): the Flow object

In {pipeR}, Kun Ren proposes several alternative pipe operators to the now traditional {magrittr}'s one (%>%). Pipe() is interesting since it encapsulates essentially the pipeline steps inside an object. The pipe operator is then simply replaced by $. It is striking to note the similitude of the $ operator for Pipe and proto objects (from the {proto} package), although they are designed for different purposes in mind. The proto objects are class-less prototype-based objects that support simple inheritance. They are convenient to manipulate sets of objects in a common place, and internally, they use an environment to store these objects. Pipe objects also use internally an environment to store everything related to the pipeline operations. However, there is no mean to add custom objects, nor to define inheritance between Pipe objects. Satellite variables may be used in pipelines. They are currently placed in the calling environment (usually .GlobalEnv), and they "pollute" it. There is no mean to define "local" variables like, say in function, with the pipe. Yet, if we could combine Pipe behavior for pipeline operation, with proto objects to store locally various items and allow inheritance, this would be a wonderful way to drive analyses workflows. The Flow object just does that.

Flow objects are indeed proto with a .value item that contains the result obtained from the last pipeline operation. The pipe operator %>.% is behaving differently when a Flow object (constructed using flow()) is passed to it: (1) . is taken from flow_obj$.value, and result updates it. Also, a .. object is created in the calling environment that is the Flow object. That way, one can access items stored in the Flow object by ..$item within pipeline expressions. This allows to embed pipeline temporary variables directly in the Flow object.

The second pipe operator in the {.flow} package, %>_%, does the opposite to %>.%: it constructs a Flow object if it does not receives one, and returns a Flow object containing the results in flow_obj$.value. Finally, to get the value out of a Flow object, on can also end the pipeline by %>_% ., which extracts flow_obj$.value and returns it. Here is an example of use:

data(iris)
fl <- iris %>_% # Create a Flow object
  filter(., Sepal.Length < 5.1, Sepal.Width < 3.1) %>_%
  mutate(., logSL = log(Sepal.Length))
# Interrupt the pipeline, and inspect or modify the flow object:
fl

With the Flow object, you can continue the pipeline where you left it, because all the required variables are recorded inside it.

fl %>_%
  group_by(., Species) %>_%
  summarise(., mean_logSL = mean(logSL)) %>_% . # Get final result

With the flow() function, you can explicitly create the Flow object and easily add variables to it, including those you want to keep as quosures (by ending their names with _):

fl <- flow(iris, var1_ = Sepal.Length, thresh1 = 5.1)
str(fl)

Note that a quosure is recorded as var, not var_! Indeed, everything works as if the trailing underscore was an unary suffixed operator applied to var, which converts it into a quosure.

You could use var in the pipeline expression to manipulate the quosure directly, but you would most probably use var_ which will also treat var as a tidyeval expression and will unquote it transparently in non-standard expressions. Here is the same pipeline as above, but with all the possible variables stored either as quosure, or as usual R objects inside the Flow object:

fl <- flow(iris,
    var1_      = Sepal.Length,
    var2_      = Sepal.Width,
    var_group_ = Species,
    var1_log_  = logSL,
    var1_mean_ = mean_logSL,
    thresh1    = 5.1,
    thresh2    = 3.1) %>_%
  filter(., var1_ < thresh1_, var2_ < thresh2_) %>_%
  {..$temp_data <- mutate(., var1_log_ = log(var1_))} %>_%
  group_by(., var_group_) %>_%
  summarise(., var1_mean_ = mean(var1_log_))
str(fl)
fl$temp_data # The temporary variable
fl %>_% . # The final results

Notice that, even standard variables, like thresh1 or thresh2 must be called thresh1_ and thresh2_ to look for them inside the Flow object. Otherwise, they will be looked for in the calling environment as usual. Also, the Flow object can be accessed and manipulated directly through .. if you need to.