# sum_by: Efficient by-group (weighted) summation In gustave: A User-Oriented Statistical Toolkit for Analytical Variance Estimation

## Description

`sum_by` performs an efficient and optionally weighted by-group summation by using linear algebra and the Matrix package capabilities. The by-group summation is performed through matrix cross-product of the `y` parameter (coerced to a matrix if needed) with a (very) sparse matrix built up using the `by` and the (optional) `w` parameters.

Compared to base R, dplyr or data.table alternatives, this implementation aims at being easier to use in a matrix-oriented context and can yield efficiency gains when the number of columns becomes high.

## Usage

 `1` ```sum_by(y, by, w = NULL, na_rm = TRUE, keep_sparse = FALSE) ```

## Arguments

 `y` A (sparse) vector, a (sparse) matrix or a data.frame. The object to perform by-group summation on. `by` The factor variable defining the by-groups. Character variables are coerced to factors. `w` The optional row weights to be used in the summation. `na_rm` Should `NA` values in `y` be removed (ie treated as 0 in the summation) ? Similar to `na.rm` argument in `sum`, but `TRUE` by default. If `FALSE`, `NA` values in `y` produce `NA` values in the result. `keep_sparse` When `y` is a sparse vector or a sparse matrix, should the result also be sparse ? `FALSE` by default. As `sparseVector-class` does not have a name attribute, when `y` is a sparseVector the result does not have any name (and a warning is cast).

## Value

A vector, a matrix or a data.frame depending on the type of `y`. If `y` is sparse and `keep_sparse = TRUE`, then the result is also sparse (without names when it is a sparse vector, see keep_sparse argument for details).

Martin Chevalier

## Examples

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19``` ```# Data generation set.seed(1) n <- 100 p <- 10 H <- 3 y <- matrix(rnorm(n*p), ncol = p, dimnames = list(NULL, paste0("var", 1:10))) y[1, 1] <- NA by <- letters[sample.int(H, n, replace = TRUE)] w <- rep(1, n) w[by == "a"] <- 2 # Standard use sum_by(y, by) # Keeping the NAs sum_by(y, by, na_rm = FALSE) # With a weight sum_by(y, by, w = w) ```

gustave documentation built on Nov. 10, 2021, 5:08 p.m.