title: Introduction to accumulate
author: Mark P.J. van der Loo
css: "style.css"
Package version packageVersion("accumulate")
{.R}.
Use citation('accumulate')
to cite the package.
Accumulate
is a package for grouped aggregation, where the groups can be
dynamically collapsed into larger groups. When this collapsing takes place and
how collapsing takes place is user-defined.
The latest CRAN release can be installed as follows.
install.packages("accumulate")
Next, the package can be loaded. You can use packageVersion
(from base R) to
check which version you have installed.
```{#load_package .R}
library(accumulate)
packageVersion("accumulate")
## A first example
We will use a built-in dataset as example.
```{#loading_data .R}
data(producers)
head(producers)
This synthetic dataset contains information on various sources of turnover from
producers, that are labeled with an economic activity classification (sbi
)
and a size
class (0-9).
We wish to find a group mean by sbi x size
. However, we demand that the group has
at least five records, otherwise we combine the size classes of a single sbi
group.
This can be done as follows.
```{#first_example .R}
a <- accumulate(producers
, collapse = sbi*size ~ sbi
, test = min_records(5)
, fun = mean, na.rm=TRUE)
head(round(a))
The accumulate function does the following:
- For each combination of `sbi` and `size` occurring in the data, it checks whether
`test` is satisfied. Here, it tests whether there are at least five records.
- If the test is satisfied, the mean is computed for each non-grouping variable
in the data. The output column `level` is set to 0 (no collapsing took place).
- If the test is _not_ satisfied, it will only use `sbi` as grouping variable
for the current combination of `sbi` and `size`. Then, if there are enough
records, the mean is computed for each variable and the output variable `level`
is set to 1 (first level of collapsing has been used).
- If the test is still not satisfied, no computation is possible
and all outputs are `NA` for the current `sbi` and `size` combination.
Explicitly, for this example we see that for `(sbi,size)==(2752,5)` no
satisfactory group of records was found under the current collapsing scheme.
Therefore the `level` variable equals `NA` and all aggregated variables are
missing as well. For `(sbi,size)==(2840,7)` there are sufficient records, and
since `level=0` no collapsing was necessary. For the group
`(sbi,size)=(3410,8)` there were not enough records to compute a mean, but
taking all records in `sbi==3410` gave enough records. This is signified by
`level=1`, meaning that one collapsing step has taken place (from `sbi x size`
to `sbi`).
Let us see how we specified this call to `accumulate`
- The first argument is the data to be aggregated.
- The second argument is a formula of the form `target groups ~ collapsing scheme`.
The output is always at the level of the target groups. The collapsing scheme determines
which records are used to compute a value for the target groups if the `test` is not
satisfied.
- The third argument, called `test` is a function that should accept any subset of
records of `producers` and return `TRUE` or `FALSE`. In this case we used the convenience
function `min_records(5)` provided by `accumulate`. The function `min_records()` creates
a testing function for us that we can pass as testing function.
- Finally, the argument `fun` is the aggregation function that will be applied to each
group.
Observe that the accumulate function is similar to R's built-in `aggregate` function (this is
by design). There is a second function called `cumulate` that has an interface that
is similar to `dplyr::summarise`.
```{#cumulate_formula .R}
a <- cumulate(producers, collapse = sbi*size ~ sbi
, test = function(d) nrow(d) >= 5
, mu_industrial = mean(industrial, na.rm=TRUE)
, sd_industrial = sd(industrial, na.rm=TRUE))
head(round(a))
Notice that here, we wrote our own test function.
(sbi, size)
could not be computed, even when
collapsing to sbi
? (You need to run the code and investigate the output).?mean
on how to compute trimmed
means.A collapsing scheme can be defined in a data frame or with a formula of the form
target grouping ~ collapse1 + collapse2 + ... + collapseN
Here, the target grouping
is a variable or product of variables. Each
collapse
term is also a variable or product of variables. Each subsequent
term defines the next collapsing step. Let us show the idea with a
more involved example.
The sbi
variable in the producers
dataset encodes a hierarchical classification
where longer digit sequences indicate higher level of detail. Hence we can collapse
to lower levels of detail by deleting digits at the end. Let us enrich the
producers
dataset with extra grouping levels.
```{#derive_sbi_levels .R} producers$sbi3 <- substr(producers$sbi,1,3) producers$sbi2 <- substr(producers$sbi,1,2) head(producers,3)
We can now use a more involved collapsing scheme as follows.
```{#accumulate_formula .R}
a <- accumulate(producers, collapse = sbi*size ~ sbi + sbi3 + sbi2
, test = min_records(5), fun = mean, na.rm=TRUE)
head(round(a))
For (sbi,size) == (2752,5)
we have 2 levels of collapsing. In other
words, for that aggregate, all records in sbi3 == 275
were used.
trade
and total
using the cumulate
function
under the same collapsing scheme as defined above.(sbi,size)
have been collapsed to
level 0, 1, 2, or 3. Tabulate them.sbi
code and compute
the means of all variables.Collapsing schemes can be represented in data frames that have the form
[target group, parent of target group, parent of parent of target group,...].
The package comes with a helper function that creates such a scheme from hierarchical classifications that are encoded as digits.
For the sbi
example we can do the following to derive a collapsing scheme.
```{#dataframe_construction .R}
sbi <- unique(producers$sbi)
csh <- csh_from_digits(sbi)
names(csh)[1] <- "sbi"
head(csh)
Here, the column `sbi` denotes the original (maximally) 5-digit codes,
`A1` the 4-digit codes, and so on. It is important that the name of
the first column matches a column in the data to be agregated.
Both `cumlate` and `accumulate` accept such a data frame as an argument.
Here is an example with `cumulate`.
```{#dataframe_cumulate .R}
a <- cumulate(producers, collapse = csh, test = function(d) nrow(d) >= 5
, mu_total = mean(total, na.rm=TRUE)
, sd_total = sd(total, na.rm=TRUE))
head(a)
In this representation is is not possible to use multiple grouping variables, unless you combine multiple grouping variables into a single one, for example by pasting them together.
The advantage of this representation is that it allows users to externally define a (manually edited) collapsing scheme.
csh
to compute the median of all numerical variables of
the producers
dataset with accumulate
(hint: you need to remove
the size
variable).There are several options to define test on groups of records:
min_records()
, min_complete()
, or frac_complete()
.from_validator()
function.Let us look at a small example for each case. For comparison we will always test that there are a minimum of five records.
```{#helpers .R}
data(producers)
a <- accumulate(producers, collapse = sbi*size ~ sbi , test = min_records(5) , fun = mean, na.rm=TRUE)
rules <- validate::validator(nrow(.) >= 5) a <- accumulate(producers, collapse = sbi*size ~ sbi , test = from_validator(rules) , fun = mean, na.rm=TRUE)
a <- accumulate(producers, collapse=sbi*size ~ sbi , test = function(d) nrow(d) >= 5 , fun = mean, na.rm=TRUE)
## Complex aggregates
An aggregate may be something more complex than a scalar. The `accumulate`
package also supports complex aggregates such as linear models.
```{#complex .R}
a <- cumulate(producers, collapse = sbi*size ~ sbi
, test = min_complete(5, c("other_income","trade"))
, model = lm(other_income ~ trade)
, mean_other = mean(other_income, na.rm=TRUE))
head(a)
Here, we demand that there are at least five records available for estimating the model.
The linear models are stored in a list
of type object_list
. Subsets or individual
elements can be accessed as usual with data frames.
```{#objlist .R}
a$model[[1]]
a$model[[2]]
### Smoke-testing your test function
If you write your own test function from scratch, it is easy to overlook some
edge cases like the occurrence of missing data, a column that is completely
`NA`, or receiving zero records. The function `smoke_test()` accepts a data set
and a test function and runs the test function on several common edge cases
based on the dataset. It does _not_ check whether the test function works as
expected, but it checks that the output is `TRUE` or `FALSE` in all cases and
reports errors, warnings and mesages if they occur.
As an example we construct a test function that checks whether one
of the variables has sufficient non-zero values.
```{#smoketest1 .R}
my_test <- function(d) sum(other != 0) > 3
smoke_test(producers, my_test)
Oops, we forgot to refer to the data set. Let's try it again. ```{#smoketest2 .R} my_test <- function(d) sum(d$other != 0) > 3 smoke_test(producers, my_test)
Our function is not robust against occurrence of `NA`. Here's a third attempt.
```{#smoketest3 .R}
my_test <- function(d) sum(d$other != 0,na.rm=TRUE) > 3
smoke_test(producers, my_test)
sbi*size ~ sbi1 + sbi2
as collapsing
scheme. Make sure there are at least 10 records in each group.industrial
and total
, but demand
that there are not more than 20% zeros in other
. Use csh
as collapsing scheme.Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.