Thoughts and observations about whether I think the basic functionality of this package will be useful. Upshot: I'm going with a tentative yes with a couple of caveats:
This is partly motivated by the "fail fast" approach. I tend to think that dataflow
isn't really solvable in a general sense, but that it can still be useful if it works for the most common DS code patterns.
Test plot_flow()
on a variety of in-the-wild R scripts. This will probably reveal bugs, so it may take some time (try not to get discouraged). I think it's okay to iterate and fix serious bugs, but my goal is to document (bugs found, features to prioritize, etc.). And, of course, assess whether dataflow
appears feasible at all.
One of the reasons I'm feeling good about the POC is that I discovered some things about the POC scripts by examining the dataflow. If the patterns I observed have generalized utility, then the approach is probably worthwhile. People are good at visual pattern recognition, and even with a short timeframe I was able to map out some pattern categories. The first two were concepts I had already internalized, but the last three never really caught my attention until I examined a variety of dataflows:
Multiple inputs with one output. This is a natural pattern in data processing and has some nice characteristics (easy to understand, modularize, etc.). The example below doesn't precisely follow that pattern since it has two outputs: a table of computed weights and a survey dataset with the weights applied. However, saving the weights table is a side-effect and the dataset itself is a direct input to the output survey table.
The opposite of assemble (i.e., fewer inputs lead to more outputs). This is probably a natural pattern in reporting-type analysis, but I suspect it can indicate repetition (a probable antipattern) in other contexts.
A linked expand-assemble. I think it naturally arises when you have one (or few) highly enriched datasets. For example, a survey dataset contains high dimensionality in one table, and you often need to compute multiple interim inputs for an output of interest (e.g., participation rate).
I originally called this spaghetti because of the confused criss-crossing of edges. On closer inspection a pattern of parallel pipelines occasionally intersecting appeared to be the cause (i.e., interdependence). I suspect there are a variety of cases where analysts will fall into this anti-pattern, and some of them are probably more problematic than others:
Checks: Running checks (in analysis) to ensure an output conforms to expectations given values in other datasets.
Summary Columns: Splitting a dataset into a relational model, and then adding summary columns (for ease of analysis) to higher-level dimensions. It can create interdependence when building higher and lower-level dimensions.
Note that the
sitepairs.R
example from theCodeDepends
package is another candidate for this interact pattern (andtimeseries.R
is a nice example of complexity).
Another cause of criss-crossing edges in when multiple inputs are used in multiple outputs. It looks awful, but it could be a simple matter of a functional approach (i.e., maybe one input varies to produce different outputs). It may still be a bit of an anti-pattern since the example I used suggests a split-apply-combine strategy would be appropriate (which would eliminate the criss-crossing).
This is a somewhat different case in which the pop
dataset is repeatedly used in interim inputs. The criss-crossing goes away if you were to collapse the interim results, but it's a funny-looking pattern nonetheless.
I think I've found a good candidate in my nc > 2a-standardize-hunt-fish.R
. Some ideas:
It seems partly a consequence of leaning so much on the sale table, which has an expansion pattern.
However, the expansion pattern itself doesn't cause interdependence, rather the interaction with other patterns does:
Challenge: Code that is highly repetitive (e.g., NE Analysis and Sample Pull.Rmd
) is a common anti-pattern.
Opportunity: The visualization does make the repetition jump out, which may be useful in itself. An interactive node collapsing option would also provide a means to navigate this complexity. Are there existing tools that could be tapped into?
The NE Analysis and Sample Pull.Rmd
code looks bizarre to me (e.g., lines 43 to 50). I wonder how common these sorts of patterns are?
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.