collap: Advanced Data Aggregation
In SebKrantz/collapse: Advanced and Fast Data Transformation

collap

R Documentation

Advanced Data Aggregation

Description

collap is a fast and versatile multi-purpose data aggregation command.

It performs simple and weighted aggregations, multi-type aggregations automatically applying different functions to numeric and categorical columns, multi-function aggregations applying multiple functions to each column, and fully custom aggregations where the user passes a list mapping functions to columns.

Usage

# Main function: allows formula and data input to `by` and `w` arguments
collap(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum,
       custom = NULL, ..., keep.by = TRUE, keep.w = TRUE, keep.col.order = TRUE,
       sort = .op[["sort"]], decreasing = FALSE, na.last = TRUE, return.order = sort,
       method = "auto", parallel = FALSE, mc.cores = 2L,
       return = c("wide","list","long","long_dupl"), give.names = "auto")

# Programmer function: allows column names and indices input to `by` and `w` arguments
collapv(X, by, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum,
        custom = NULL, ..., keep.by = TRUE, keep.w = TRUE, keep.col.order = TRUE,
        sort = .op[["sort"]], decreasing = FALSE, na.last = TRUE, return.order = sort,
        method = "auto", parallel = FALSE, mc.cores = 2L,
        return = c("wide","list","long","long_dupl"), give.names = "auto")

# Auxiliary function: for grouped data ('grouped_df') input + non-standard evaluation
collapg(X, FUN = fmean, catFUN = fmode, cols = NULL, w = NULL, wFUN = fsum,
        custom = NULL, keep.group_vars = TRUE, ...)

Arguments

`X`	a data frame, or an object coercible to data frame using `qDF`.
`by`	for `collap`: a one-or two sided formula, i.e. `~ group1` or `var1 + var2 ~ group1 + group2`, or a atomic vector, list of vectors or `GRP` object used to group `X`. For `collapv`: names or indices of grouping columns, or a logical vector or selector function such as `is_categorical` selecting grouping columns.
`FUN`	a function, list of functions (i.e. `list(fsum, fmean, fsd)` or `list(sd = fsd, myfun1 = function(x)..)`), or a character vector of function names, which are automatically applied only to numeric variables.
`catFUN`	same as `FUN`, but applied only to categorical (non-numeric) typed columns (`is_categorical`).
`cols`	select columns to aggregate using a function, column names, indices or logical vector. Note: `cols` is ignored if a two-sided formula is passed to `by`.
`w`	weights. Can be passed as numeric vector or alternatively as formula i.e. `~ weightvar` in `collap` or column name / index etc. i.e. `"weightvar"` in `collapv`. `collapg` supports non-standard evaluations so `weightvar` can be indicated without quotes.
`wFUN`	same as `FUN`: Function(s) to aggregate weight variable if `keep.w = TRUE`. By default the sum of the weights is computed in each group.
`custom`	a named list specifying a fully customized aggregation task. The names of the list are function names and the content columns to aggregate using this function (same input as `cols`). For example `custom = list(fmean = 1:6, fsd = 7:9, fmode = 10:11)` tells `collap` to aggregate columns 1-6 of `X` using the mean, columns 7-9 using the standard deviation etc. Notes: `custom` lets `collap` ignore any inputs passed to `FUN`, `catFUN` or `cols`. Since v1.6.0 you can also rename columns e.g. `custom = list(fmean = c(newname = "col1", "col2"), fmode = c(newname = 3))`.
`keep.by`, `keep.group_vars`	logical. `FALSE` will omit grouping variables from the output. `TRUE` keeps the variables, even if passed externally in a list or vector (unlike other collapse functions).
`keep.w`	logical. `FALSE` will omit weight variable from the output i.e. no aggregation of the weights. `TRUE` aggregates and adds weights, even if passed externally as a vector (unlike other collapse functions).
`keep.col.order`	logical. Retain original column order post-aggregation.
`sort`, `decreasing`, `na.last`, `return.order`, `method`	logical / character. Arguments passed to `GRP.default` and affecting the row-order in the aggregated data frame and the grouping algorithm.
`parallel`	logical. Use `mclapply` instead of `lapply` to parallelize the computation at the column level. Not available for Windows.
`mc.cores`	integer. Argument to `mclapply` setting the number of cores to use, default is 2.
`return`	character. Control the output format when aggregating with multiple functions or performing custom aggregation. "wide" (default) returns a wider data frame with added columns for each additional function. "list" returns a list of data frames - one for each function. "long" adds a column "Function" and row-binds the results from different functions using `data.table::rbindlist`. "long_dupl" is a special option for aggregating multi-type data using multiple `FUN` but only one `catFUN` or vice-versa. In that case the format is long and data aggregated using only one function is duplicated. See Examples.
`give.names`	logical. Create unique names of aggregated columns by adding a prefix 'FUN.var'. `'auto'` will automatically create such prefixes whenever multiple functions are applied to a column.
`...`	additional arguments passed to all functions supplied to `FUN`, `catFUN`, `wFUN` or `custom`. Since v1.9.0 these are also split by groups for non-Fast Statistical Functions. The behavior of Fast Statistical Functions with unused arguments is regulated by `option("collapse_unused_arg_action")` and defaults to `"warning"`. `collapg` also allows other arguments to `collap` except for `sort, decreasing, na.last, return.order, method` and `keep.by`.

Details

collap automatically checks each function passed to it whether it is a Fast Statistical Function (i.e. whether the function name is contained in .FAST_STAT_FUN). If the function is a fast statistical function, collap only does the grouping and then calls the function to carry out the grouped computations (vectorized in C/C++), resulting in high aggregation speeds, even with weights. If the function is not one of .FAST_STAT_FUN, BY is called internally to perform the computation. The resulting computations from each function are put into a list and recombined to produce the desired output format as controlled by the return argument. This is substantially slower, particularly with many groups.

When setting parallel = TRUE on a non-windows computer, aggregations will efficiently be parallelized at the column level using mclapply utilizing mc.cores cores. Some Fast Statistical Function support multithreading i.e. have an nthreads argument that can be passed to collap. Using C-level multithreading is much more effective than R-level parallelism, and also works on Windows, but the two should never be combined.

When the w argument is used, the weights are passed to all functions except for wFUN. This may be undesirable in settings like collap(data, ~ id, custom = list(fsum = ..., fmean = ...), w = ~ weights) where we wish to aggregate some columns using the weighted mean, and others using a simple sum or another unweighted statistic. Therefore it is possible to append Fast Statistical Functions by _uw to yield an unweighted computation. So for the above example one can specify: collap(data, ~ id, custom = list(fsum_uw = ..., fmean = ...), w = ~ weights) to get the weighted mean and the simple sum. Note that the _uw functions are not available for use outside collap. Thus one also needs to quote them when passing to the FUN or catFUN arguments, e.g. use collap(data, ~ id, fmean, "fmode_uw", w = ~ weights).

Value

X aggregated. If X is not a data frame it is coerced to one using qDF and then aggregated.

Examples

## A Simple Introduction --------------------------------------
head(iris)
collap(iris, ~ Species)                                        # Default: FUN = fmean for numeric
collapv(iris, 5)                                               # Same using collapv
collap(iris, ~ Species, fmedian)                               # Using the median
collap(iris, ~ Species, fmedian, keep.col.order = FALSE)       # Groups in-front
collap(iris, Sepal.Width + Petal.Width ~ Species, fmedian)     # Only '.Width' columns
collapv(iris, 5, cols = c(2, 4))                               # Same using collapv
collap(iris, ~ Species, list(fmean, fmedian))                  # Two functions
collap(iris, ~ Species, list(fmean, fmedian), return = "long") # Long format
collapv(iris, 5, custom = list(fmean = 1:2, fmedian = 3:4))    # Custom aggregation
collapv(iris, 5, custom = list(fmean = 1:2, fmedian = 3:4),    # Raw output, no column reordering
        return = "list")
collapv(iris, 5, custom = list(fmean = 1:2, fmedian = 3:4),    # A strange choice..
        return = "long")
collap(iris, ~ Species, w = ~ Sepal.Length)                    # Using Sepal.Length as weights, ..
weights <- abs(rnorm(fnrow(iris)))
collap(iris, ~ Species, w = weights)                           # Some random weights..
collap(iris, iris$Species, w = weights)                        # Note this behavior..
collap(iris, iris$Species, w = weights,
       keep.by = FALSE, keep.w = FALSE)



## Multi-Type Aggregation --------------------------------------
head(wlddev)                                                    # World Development Panel Data
head(collap(wlddev, ~ country + decade))                        # Aggregate by country and decade
head(collap(wlddev, ~ country + decade, fmedian, ffirst))       # Different functions
head(collap(wlddev, ~ country + decade, cols = is.numeric))     # Aggregate only numeric columns
head(collap(wlddev, ~ country + decade, cols = 9:13))           # Only the 5 series
head(collap(wlddev, PCGDP + LIFEEX ~ country + decade))         # Only GDP and life-expactancy
head(collap(wlddev, PCGDP + LIFEEX ~ country + decade, fsum))   # Using the sum instead
head(collap(wlddev, PCGDP + LIFEEX ~ country + decade, sum,     # Same using base::sum -> slower!
            na.rm = TRUE))
head(collap(wlddev, wlddev[c("country","decade")], fsum,        # Same, exploring different inputs
            cols = 9:10))
head(collap(wlddev[9:10], wlddev[c("country","decade")], fsum))
head(collapv(wlddev, c("country","decade"), fsum))              # ..names/indices with collapv
head(collapv(wlddev, c(1,5), fsum))

g <- GRP(wlddev, ~ country + decade)                            # Precomputing the grouping
head(collap(wlddev, g, keep.by = FALSE))                        # This is slightly faster now
# Aggregate categorical data using not the mode but the last element
head(collap(wlddev, ~ country + decade, fmean, flast))
head(collap(wlddev, ~ country + decade, catFUN = flast,         # Aggregate only categorical data
            cols = is_categorical))


## Weighted Aggregation ----------------------------------------
# We aggregate to region level using population weights
head(collap(wlddev, ~ region + year, w = ~ POP))                # Takes weighted mean for numeric..
# ..and weighted mode for categorical data. The weight vector is aggregated using fsum

head(collap(wlddev, ~ region + year, w = ~ POP,                 # Aggregating weights using sum
            wFUN = list(sum = fsum, max = fmax)))               # and max (corresponding to mode)


## Multi-Function Aggregation ----------------------------------
head(collap(wlddev, ~ country + decade, list(mean = fmean, N = fnobs),  # Saving mean and Nobs
            cols = 9:13))

head(collap(wlddev, ~ country + decade,                         # Same using base R -> slower
            list(mean = mean,
                 N = function(x, ...) sum(!is.na(x))),
            cols = 9:13, na.rm = TRUE))

lapply(collap(wlddev, ~ country + decade,                       # List output format
       list(mean = fmean, N = fnobs), cols = 9:13, return = "list"), head)

head(collap(wlddev, ~ country + decade,                         # Long output format
     list(mean = fmean, N = fnobs), cols = 9:13, return = "long"))

head(collap(wlddev, ~ country + decade,                         # Also aggregating categorical data,
     list(mean = fmean, N = fnobs), return = "long_dupl"))      # and duplicating it 2 times

head(collap(wlddev, ~ country + decade,                         # Now also using 2 functions on
     list(mean = fmean, N = fnobs), list(mode = fmode, last = flast),   # categorical data
            keep.col.order = FALSE))

head(collap(wlddev, ~ country + decade,                         # More functions, string input,
            c("fmean","fsum","fnobs","fsd","fvar"),             # parallelized execution
            c("fmode","ffirst","flast","fndistinct"),           # (choose more than 1 cores,
            parallel = TRUE, mc.cores = 1L,                     # depending on your machine)
            keep.col.order = FALSE))


## Custom Aggregation ------------------------------------------
head(collap(wlddev, ~ country + decade,                         # Custom aggregation
            custom = list(fmean = 11:13, fsd = 9:10, fmode = 7:8)))

head(collap(wlddev, ~ country + decade,                         # Using column names
            custom = list(fmean = "PCGDP", fsd = c("LIFEEX","GINI"),
                          flast = "date")))

head(collap(wlddev, ~ country + decade,                         # Weighted parallelized custom
            custom = list(fmean = 9:12, fsd = 9:10,             # aggregation
                          fmode = 7:8), w = ~ POP,
            wFUN = list(fsum, fmax),
            parallel = TRUE, mc.cores = 1L))

head(collap(wlddev, ~ country + decade,                         # No column reordering
            custom = list(fmean = 9:12, fsd = 9:10,
                          fmode = 7:8), w = ~ POP,
            wFUN = list(fsum, fmax),
            parallel = TRUE, mc.cores = 1L, keep.col.order = FALSE))

## Piped Use --------------------------------------------------
iris |> fgroup_by(Species) |> collapg()
wlddev |> fgroup_by(country, decade) |> collapg() |> head()
wlddev |> fgroup_by(region, year) |> collapg(w = POP) |> head()
wlddev |> fgroup_by(country, decade) |> collapg(fmedian, flast) |> head()
wlddev |> fgroup_by(country, decade) |>
  collapg(custom = list(fmean = 9:12, fmode = 5:7, flast = 3)) |> head()

SebKrantz/collapse documentation built on June 12, 2025, 1:44 a.m.