doc/POST2.md
In cumplyr: Extends ddply to allow calculation of cumulative quantities.

I use Hadley Wickham's 'plyr' package a lot

I don't want people to become enamored of this package: I want people to experiment, to debug it, to make it fast and to figure out what the obvious.

The plyr package provides a very powerful implementation of the split-apply-combine strategy outlined by Hadley Wickham in his paper, "The split-apply-combine strategy". In this post, I'm going to describe a simple general-purpose extension to the implementation of this strategy as it stands in plyr and link to code that implements this extension.

Before describing the extension in its most generic formulation, I'll describe a simple data analysis problem in which the need for this alternative formulation of the split-apply-combine strategy arises. Let's assume that we have the following data set:

[SHOW INPUT_DATA]

In this data set, two different subjects in a psychology experiment have performed a experiental task repeatedly in 2 blocks, each of which contained 3 trials. For each trial, we have recorded the subjects' reaction time (RT). For theoretical reasons, we wish to compute, at a trial-by-trial level, the mean RT of all previous trials in the current block. The desired output of this computation is:

[SHOW OUTPUT_DATA]

If we didn't have to consider subjects and blocks, we could write a function like cumsum (called, for example, cummean) and simply call that function once. But we need to perform this cummean operation on many splits of the data and then combine the results back together again. This is much work more programming work, especially when this sort of thing could be solved once and for all if we articulated the general strategy and wrote a broader algorithm.

In an abstract way, the task I've described clearly falls under the rubric of split-apply-combine (and, in fact, is mentioned in the split-apply-combine paper), but it is not possible to perform this computation using plyr functions because ddply splits the data into disjoint sets, against each of which the desired function is applied. One way to think of this splitting procedure is the use of equality constraints: there are N variables and the M splits generated by those variables correspond to data points for which the settings of the N variables take on some exact value. In our case, we cannot equality constraints, but must use inequality constraints: we wish to split the data so that each of the M splits corresponds to a susbet of the data in which the values of some of the N variables are below some threshold. Then I need to combine results using the sequence of upper bounds used for the iterative inequality constraints as my indices.

Of course, when seen in that way, this design pattern is an obvious extension of Hadley's split-apply-combine pattern by relaxing equalities to inequalities. There's very little depth to this, but it considerably enlarges the scope of computations that I can perform using a plyr-style function. On GitHub, I've set up an R package containg a simple implementation of cumddply, which performs this inequality-constrained splitting. It takes as inputs a data set, a set of equality constraints, a set of inequality constraints and a function to apply to each split of the data. The computational strategy is the following:

For all constraining variables, calculate the unique, sorted values taken on by that variable
Find the Cartesian product of all the unique, sorted values of all these variables
Iterate over the elements of this Cartesian product
Compute the currently active constraints based on the current element of the Cartesian product
Find the subset of data that satisfies these active constraints
Apply the specified function to this split-out subset of the data
Add a row of new data to an accumulate data frame of results that contains the value of function as it was applied to this subset
Return results

There's an obvious extension to having inequalities operate going in the opposite direction, but I don't currently see any value in that (though I'm happy to be convinced otherwise). Also, there's another trivial extension of this logic which doesn't use the iterated constraints as upper-bounds (or lower-bounds), but as the center of a normed ball of specified radius. This, for instance, would allow one to compute kernelized local means on many splits of the data in a single function call.