inst/doc/introduction.md

title: Introduction to accumulate author: Mark P.J. van der Loo css: "style.css"

Package version packageVersion("accumulate"){.R}.

Use citation('accumulate') to cite the package.

Introduction

Accumulate is a package for grouped aggregation, where the groups can be dynamically collapsed into larger groups. When this collapsing takes place and how collapsing takes place is user-defined.

Installation

The latest CRAN release can be installed as follows.

install.packages("accumulate")

Next, the package can be loaded. You can use packageVersion (from base R) to check which version you have installed. ```{#load_package .R} library(accumulate)

check the package version

packageVersion("accumulate")


## A first example

We will use a built-in dataset as example. 
```{#loading_data .R}
data(producers)
head(producers)

This synthetic dataset contains information on various sources of turnover from producers, that are labeled with an economic activity classification (sbi) and a size class (0-9).

We wish to find a group mean by sbi x size. However, we demand that the group has at least five records, otherwise we combine the size classes of a single sbi group. This can be done as follows. ```{#first_example .R} a <- accumulate(producers , collapse = sbi*size ~ sbi , test = min_records(5) , fun = mean, na.rm=TRUE) head(round(a))

The accumulate function does the following:

- For each combination of `sbi` and `size` occurring in the data, it checks whether
  `test` is satisfied. Here, it tests whether there are at least five records.
    - If the test is satisfied, the mean is computed for each non-grouping variable
      in the data. The output column `level` is set to 0 (no collapsing took place).
    - If the test is _not_ satisfied, it will only use `sbi` as grouping variable
      for the current combination of `sbi` and `size`. Then, if there are enough
      records, the mean is computed for each variable and the output variable `level`
      is set to 1 (first level of collapsing has been used). 
    - If the test is still not satisfied, no computation is possible
      and all outputs are `NA` for the current `sbi` and `size` combination.

Explicitly, for this example we see that for `(sbi,size)==(2752,5)` no
satisfactory group of records was found under the current collapsing scheme.
Therefore the `level` variable equals `NA` and all aggregated variables are
missing as well.  For `(sbi,size)==(2840,7)` there are sufficient records, and
since `level=0` no collapsing was necessary. For the group
`(sbi,size)=(3410,8)` there were not enough records to compute a mean, but
taking all records in `sbi==3410` gave enough records. This is signified by
`level=1`, meaning that one collapsing step has taken place (from `sbi x size`
to `sbi`).


Let us see how we specified this call to `accumulate`

- The first argument is the data to be aggregated.
- The second argument is a formula of the form `target groups ~ collapsing scheme`.
  The output is always at the level of the target groups. The collapsing scheme determines
  which records are used to compute a value for the target groups if the `test` is not
  satisfied.
- The third argument, called `test` is a function that should accept any subset of 
  records of `producers` and return `TRUE` or `FALSE`. In this case we used the convenience
  function `min_records(5)` provided by `accumulate`. The function `min_records()` creates
  a testing function for us that we can pass as testing function.
- Finally, the argument `fun` is the aggregation function that will be applied to each
  group. 

Observe that the accumulate function is similar to R's built-in `aggregate` function (this is
by design). There is a second function called `cumulate` that has an interface that
is similar to `dplyr::summarise`.

```{#cumulate_formula .R}
a <- cumulate(producers, collapse = sbi*size ~ sbi
      , test = function(d) nrow(d) >= 5
      , mu_industrial = mean(industrial, na.rm=TRUE)
      , sd_industrial = sd(industrial, na.rm=TRUE))

head(round(a))

Notice that here, we wrote our own test function.

Exercises

  1. How many combinations of (sbi, size) could not be computed, even when collapsing to sbi? (You need to run the code and investigate the output).
  2. Compute the trimmed mean of all numeric variables where you trim 5% of each side the distribution. See ?mean on how to compute trimmed means.

The formula interface for specifying collapsing schemes

A collapsing scheme can be defined in a data frame or with a formula of the form

target grouping ~ collapse1 + collapse2 + ... + collapseN

Here, the target grouping is a variable or product of variables. Each collapse term is also a variable or product of variables. Each subsequent term defines the next collapsing step. Let us show the idea with a more involved example.

The sbi variable in the producers dataset encodes a hierarchical classification where longer digit sequences indicate higher level of detail. Hence we can collapse to lower levels of detail by deleting digits at the end. Let us enrich the producers dataset with extra grouping levels.

```{#derive_sbi_levels .R} producers$sbi3 <- substr(producers$sbi,1,3) producers$sbi2 <- substr(producers$sbi,1,2) head(producers,3)


We can now use a more involved collapsing scheme as follows.
```{#accumulate_formula .R}
a <- accumulate(producers, collapse = sbi*size ~ sbi + sbi3 + sbi2
               , test = min_records(5), fun = mean, na.rm=TRUE)
head(round(a))

For (sbi,size) == (2752,5) we have 2 levels of collapsing. In other words, for that aggregate, all records in sbi3 == 275 were used.

Exercises

  1. Compute standard deviation for trade and total using the cumulate function under the same collapsing scheme as defined above.
  2. What is the maximum collapsing level in the collapsing scheme above?
  3. Find out how many combinations of (sbi,size) have been collapsed to level 0, 1, 2, or 3. Tabulate them.
  4. Define a collapsing scheme that ends with a single-digit sbi code and compute the means of all variables.

The data frame interface for defining collapsing schemes

Collapsing schemes can be represented in data frames that have the form

[target group, parent of target group, parent of parent of target group,...].

The package comes with a helper function that creates such a scheme from hierarchical classifications that are encoded as digits.

For the sbi example we can do the following to derive a collapsing scheme. ```{#dataframe_construction .R} sbi <- unique(producers$sbi) csh <- csh_from_digits(sbi) names(csh)[1] <- "sbi" head(csh)

Here, the column `sbi` denotes the original (maximally) 5-digit codes,
`A1` the 4-digit codes, and so on. It is important that the name of
the first column matches a column in the data to be agregated.
Both `cumlate` and `accumulate` accept such a data frame as an argument.
Here is an example with `cumulate`.

```{#dataframe_cumulate .R}
a <- cumulate(producers, collapse = csh, test = function(d) nrow(d) >= 5
       , mu_total = mean(total, na.rm=TRUE)
       , sd_total = sd(total, na.rm=TRUE))
head(a)

In this representation is is not possible to use multiple grouping variables, unless you combine multiple grouping variables into a single one, for example by pasting them together.

The advantage of this representation is that it allows users to externally define a (manually edited) collapsing scheme.

Exercises

  1. Use csh to compute the median of all numerical variables of the producers dataset with accumulate (hint: you need to remove the size variable).

Convenience functions to define tests

There are several options to define test on groups of records:

  1. Use one of the built-in functions to specify common test conditions: min_records(), min_complete(), or frac_complete().
  2. Use a ruleset defined with the validate package, with the from_validator() function.
  3. Write your own custom test function.

Let us look at a small example for each case. For comparison we will always test that there are a minimum of five records.

```{#helpers .R}

load the data again to loose columns 'sbi2' and 'sbi3' and work

with the original data.

data(producers)

1. using a helper function

a <- accumulate(producers, collapse = sbi*size ~ sbi , test = min_records(5) , fun = mean, na.rm=TRUE)

2. using a 'validator' object

rules <- validate::validator(nrow(.) >= 5) a <- accumulate(producers, collapse = sbi*size ~ sbi , test = from_validator(rules) , fun = mean, na.rm=TRUE)

3. using a custom function

a <- accumulate(producers, collapse=sbi*size ~ sbi , test = function(d) nrow(d) >= 5 , fun = mean, na.rm=TRUE)


## Complex aggregates

An aggregate may be something more complex than a scalar. The `accumulate`
package also supports complex aggregates such as linear models.

```{#complex .R}
a <- cumulate(producers, collapse = sbi*size ~ sbi
                       , test = min_complete(5, c("other_income","trade"))
                       , model = lm(other_income ~ trade)
                       , mean_other = mean(other_income, na.rm=TRUE))

head(a)

Here, we demand that there are at least five records available for estimating the model.

The linear models are stored in a list of type object_list. Subsets or individual elements can be accessed as usual with data frames. ```{#objlist .R} a$model[[1]] a$model[[2]]






### Smoke-testing your test function

If you write your own test function from scratch, it is easy to overlook some
edge cases like the occurrence of missing data, a column that is completely
`NA`, or receiving zero records. The function `smoke_test()` accepts a data set
and a test function and runs the test function on several common edge cases
based on the dataset. It does _not_ check whether the test function works as
expected, but it checks that the output is `TRUE` or `FALSE` in all cases and
reports errors, warnings and mesages if they occur.


As an example we construct a test function that checks whether one
of the variables has sufficient non-zero values.
```{#smoketest1 .R}
my_test <- function(d) sum(other != 0) > 3
smoke_test(producers, my_test)

Oops, we forgot to refer to the data set. Let's try it again. ```{#smoketest2 .R} my_test <- function(d) sum(d$other != 0) > 3 smoke_test(producers, my_test)

Our function is not robust against occurrence of `NA`. Here's a third attempt.
```{#smoketest3 .R}
my_test <- function(d) sum(d$other != 0,na.rm=TRUE) > 3
smoke_test(producers, my_test)

Exercises

  1. Compute the mean of all variables using sbi*size ~ sbi1 + sbi2 as collapsing scheme. Make sure there are at least 10 records in each group.
  2. Compute the mean of the ratio between industrial and total, but demand that there are not more than 20% zeros in other. Use csh as collapsing scheme.


Try the accumulate package in your browser

Any scripts or data that you put into this service are public.

accumulate documentation built on April 4, 2025, 2:13 a.m.