# stream-stats: Streaming Summary Statistics In matter: A framework for rapid prototyping with file-based data structures

## Description

These functions allow calculation of streaming statistics. They are useful, for example, for calculating summary statistics on small chunks of a larger dataset, and then combining them to calculate the summary statistics for the whole dataset.

This is not particularly interesting for simpler, commutative statistics like `sum()`, but it is useful for calculating non-commutative statistics like running `sd()` or `var()` on pieces of a larger dataset.

## Usage

 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30``` ```# calculate streaming univariate statistics s_range(x, ..., na.rm = FALSE) s_min(x, ..., na.rm = FALSE) s_max(x, ..., na.rm = FALSE) s_prod(x, ..., na.rm = FALSE) s_sum(x, ..., na.rm = FALSE) s_mean(x, ..., na.rm = FALSE) s_var(x, ..., na.rm = FALSE) s_sd(x, ..., na.rm = FALSE) s_any(x, ..., na.rm = FALSE) s_all(x, ..., na.rm = FALSE) s_nnzero(x, ..., na.rm = FALSE) # calculate streaming matrix statistics colstreamStats(x, stat, na.rm = FALSE, ...) rowstreamStats(x, stat, na.rm = FALSE, ...) # calculate combined summary statistics stat_c(x, y, ...) ```

## Arguments

 `x, y, ...` Object(s) on which to calculate a summary statistic, or a summary statistic to combine. `stat` The name of a summary statistic to compute over the rows or columns of a matrix. Allowable values include: "range", "min", "max", "prod", "sum", "mean", "var", "sd", "any", "all", and "nnzero". `na.rm` If `TRUE`, remove `NA` values before summarizing.

## Details

These summary statistics methods are intended to be applied to chunks of a larger dataset. They can then be combined either with the individual summary statistic functions, or with `stat_c()`, to produce the combined summary statistic for the full dataset. This is most useful for calculating running variances and standard deviations iteratively, which would be difficult or impossible to calculate on the full dataset.

The variances and standard deviations are calculated using running sum of squares formulas which can be calculated iteratively and are accurate for large floating-point datasets (see reference).

## Value

For all univariate functions except `s_range()`, a single number giving the summary statistic. For `s_range()`, two numbers giving the minimum and the maximum values.

For `colstreamStats()` and `rowstreamStats()`, a vector of summary statistics.

Kylie A. Bemis

## References

B. P. Welford, “Note on a Method for Calculating Corrected Sums of Squares and Products,” Technometrics, vol. 4, no. 3, pp. 1-3, Aug. 1962.

B. O'Neill, “Some Useful Moment Results in Sampling Problems,” The American Statistician, vol. 68, no. 4, pp. 282-296, Sep. 2014.

`Summary`
 ``` 1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26``` ```set.seed(1) x <- sample(1:100, size=10) y <- sample(1:100, size=10) sx <- s_var(x) sy <- s_var(y) var(c(x, y)) stat_c(sx, sy) # should be the same sxy <- stat_c(sx, sy) # calculate with 1 new observation var(c(x, y, 99)) stat_c(sxy, 99) # calculate over rows of a matrix set.seed(2) A <- matrix(rnorm(100), nrow=10) B <- matrix(rnorm(100), nrow=10) sx <- rowstreamStats(A, "var") sy <- rowstreamStats(B, "var") apply(cbind(A, B), 1, var) stat_c(sx, sy) # should be the same ```