pied
is designed to cleanly separate the elements of a workflow.
A computational pipeline is a composition of these primary elements:
Also the code should be readable, reusable, and flexible.
pied
's real power is in highly-branched workflows with complex data and
many visualization and logging requirements.
In this test case I will make a simple hierarchical cluster of the iris data set.
library(pied) library(magrittr)
# Load a builtin R dataset and return f_load_data <- function() { data(iris, envir=environment()) iris$Species <- NULL iris } # Normalize the numeric columns f_normalize <- function(x) { as.data.frame(lapply(x, function(x) (x - mean(x)) / sd(x))) } # Build a distance matrix f_dist <- function(x, ...) { dist(x, ...) } # Build a hierarchical cluster f_hclust <- function(x, ...) { hclust(x, ...) }
With a bit of magrittr magic these can be composed into a simple pipeline that produces a single plot.
f_load_data() %>% f_normalize %>% f_dist %>% f_hclust %>% plot
This is beautiful, but what if we want to validate the inputs of each.
I can add validation code to each function, calling an error on a negative result:
# Normalize the numeric columns f_normalize_2 <- function(x) { expected_names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") if(! is.data.frame(x) && length(expected_names) == length(names(x)) && all(names(x) == expected_names) && all(unlist(lapply(x, is.numeric))) ){ stop("Bad input") } as.data.frame(lapply(x, function(x) (x - mean(x)) / sd(x))) } # Build a distance matrix f_dist_2 <- function(x, ...) { if(! is.data.frame(x) && all(unlist(lapply(x, is.numeric))) && all(unlist(lapply(x, function(x) abs(mean(x)) < 1e-5))) && all(unlist(lapply(x, function(x) abs((sd(x) - 1)) < 1e-5))) ){ stop("Bad input") } dist(x, ...) } # Build a hierarchical cluster f_hclust_2 <- function(x, ...) { if(! 'dist' %in% class(x)){ stop("Bad input") } hclust(x, ...) }
Now what if we want to print intermediate results? We can further modify our functions.
f_dist_3 <- function(x, ...) { if(! is.data.frame(x) && all(unlist(lapply(x, is.numeric))) && all(unlist(lapply(x, function(x) abs(mean(x)) < 1e-5))) && all(unlist(lapply(x, function(x) abs((sd(x) - 1)) < 1e-5))) ){ stop("Bad input") } result <- dist(x, ...) print(summary(result)) result }
f_load_data() %>% f_normalize_2 %>% f_dist_3 %>% f_hclust_2 %>% nothing
Note nothing
is just a function that prevents anything from being printed to
output (like /dev/null
), just to avoid cluttering the vignette. Now what if
we decide we don't like always printing summaries, but only if verbose is set.
f_dist_4 <- function(x, verbose=FALSE, ...) { if(! is.data.frame(x) && all(unlist(lapply(x, is.numeric))) && all(unlist(lapply(x, function(x) abs(mean(x)) < 1e-5))) && all(unlist(lapply(x, function(x) abs((sd(x) - 1)) < 1e-5))) ){ stop("Bad input") } result <- dist(x, ...) if(verbose){ print(summary(result)) } result }
f_load_data() %>% f_normalize_2 %>% f_dist_4(verbose=TRUE) %>% f_hclust_2 %>% nothing
Now what if we also want to optionally plot the intermediate data and optionally cache the results?
f_dist_5 <- function(x, verbose=FALSE, doplot=FALSE, cache="", ...) { if(! is.data.frame(x) && all(unlist(lapply(x, is.numeric))) && all(unlist(lapply(x, function(x) abs(mean(x)) < 1e-5))) && all(unlist(lapply(x, function(x) abs((sd(x) - 1)) < 1e-5))) ){ stop("Bad input") } if(file.exists(cache)){ load(cache) } else { result <- dist(x, ...) save(result, file=cache) } if(verbose){ print(summary(result)) } if(doplot){ plot(result) } result }
f_load_data() %>% f_normalize_2 %>% f_dist_5(verbose=TRUE, cache="load_data.Rdata") %>% f_hclust_2 %>% nothing
This is still a very minimal example of how bloated a function can become. The
original trivial function function(x,...){dist(x,...)}
has grown into a
20-line unreadable mess. Even worse, this mess may need to be duplicated into
the other functions.
pied
's solution is to partition each function in the pipeline into the
following elements: a pure function, effect functions, a cache function, a
validator function, and a pass and a fail function. This composite function can
be easily created with the hnode
command.
Here is the minimal pied example
h_load_data <- hwell('a') h_normalize <- hpipe('a -> b') h_dist <- hpipe('b -> c') h_hclust <- hpipe('c -> d') h_fun(h_load_data) <- f_load_data h_fun(h_normalize) <- f_normalize h_fun(h_dist) <- f_dist h_fun(h_hclust) <- f_hclust connect('h_load_data --> h_normalize --> h_dist --> h_hclust') h_hclust()
hnode
returns a function that should normally be called with no arguments.
Each call recursively calls the input functions until a source function is
reached.
Now we can begin expanding upon these basic functions. First I will add validators. Rather than writing validation code into the pure functions, I will write them as independent functions.
v_iris <- function(x) { expected_names = c("Sepal.Length", "Sepal.Width", "Petal.Length", "Petal.Width") is.data.frame(x) && length(expected_names) == length(names(x)) && all(names(x) == expected_names) && all(unlist(lapply(x, is.numeric))) } v_normed_df <- function(x, ...) { is.data.frame(x) && all(unlist(lapply(x, is.numeric))) && all(unlist(lapply(x, function(x) abs(mean(x)) < 1e-5))) && all(unlist(lapply(x, function(x) abs((sd(x) - 1)) < 1e-5))) } # Build a hierarchical cluster v_dist <- function(x, ...) { 'dist' %in% class(x) } h_val(h_normalize) <- v_iris h_val(h_dist) <- v_normed_df h_val(h_hclust) <- v_dist h_hclust()
Now lets add output summary functions to each
summary_and_print <- compose(summary, print) h_effect(h_load_data) <- summary_and_print h_effect(h_normalize) <- summary_and_print h_effect(h_dist) <- summary_and_print h_effect(h_hclust) <- summary_and_print
Now we can run this
h_hclust()
Full executable code
library(pied) # Load a builtin R dataset and return f_load_data <- function() { data(iris, envir=environment()) iris$Species <- NULL iris } # Normalize the numeric columns f_normalize <- function(x) { as.data.frame(lapply(x, function(x) (x - mean(x)) / sd(x))) } # Build a distance matrix f_dist <- function(x, ...) { dist(x, ...) } # Build a hierarchical cluster f_hclust <- function(x, ...) { hclust(x, ...) } h_load_data <- hwell('a') h_normalize <- hpipe('a -> b') h_dist <- hpipe('b -> c') h_hclust <- hpipe('c -> d') h_fun(h_load_data) <- f_load_data h_fun(h_normalize) <- f_normalize h_fun(h_dist) <- f_dist h_fun(h_hclust) <- f_hclust connect('h_load_data --> h_normalize --> h_dist --> h_hclust') h_hclust()
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.