stream_stat: Streaming Summary Statistics

stream_statR Documentation

Streaming Summary Statistics

Description

These functions allow calculation of streaming statistics. They are useful, for example, for calculating summary statistics on small chunks of a larger dataset, and then combining them to calculate the summary statistics for the whole dataset.

This is not particularly interesting for simpler, commutative statistics like sum(), but it is useful for calculating non-commutative statistics like running sd() or var() on pieces of a larger dataset.

Usage

# calculate streaming univariate statistics
s_stat(x, stat, group, na.rm = FALSE, ...)

s_range(x, ..., na.rm = FALSE)

s_min(x, ..., na.rm = FALSE)

s_max(x, ..., na.rm = FALSE)

s_prod(x, ..., na.rm = FALSE)

s_sum(x, ..., na.rm = FALSE)

s_mean(x, ..., na.rm = FALSE)

s_var(x, ..., na.rm = FALSE)

s_sd(x, ..., na.rm = FALSE)

s_any(x, ..., na.rm = FALSE)

s_all(x, ..., na.rm = FALSE)

s_nnzero(x, ..., na.rm = FALSE)

# calculate streaming matrix statistics
s_rowstats(x, stat, group, na.rm = FALSE, ...)

s_colstats(x, stat, group, na.rm = FALSE, ...)

# calculate combined summary statistics
stat_c(x, y, ...)

Arguments

x, y, ...

Object(s) on which to calculate a summary statistic, or a summary statistic to combine.

stat

The name of a summary statistic to compute over the rows or columns of a matrix. Allowable values include: "range", "min", "max", "prod", "sum", "mean", "var", "sd", "any", "all", and "nnzero".

group

A factor or vector giving the grouping. If not provided, no grouping will be used.

na.rm

If TRUE, remove NA values before summarizing.

Details

These summary statistics methods are intended to be applied to chunks of a larger dataset. They can then be combined either with the individual summary statistic functions, or with stat_c(), to produce the combined summary statistic for the full dataset. This is most useful for calculating running variances and standard deviations iteratively, which would be difficult or impossible to calculate on the full dataset.

The variances and standard deviations are calculated using running sum of squares formulas which can be calculated iteratively and are accurate for large floating-point datasets (see reference).

Value

For all univariate functions except s_range(), a single number giving the summary statistic. For s_range(), two numbers giving the minimum and the maximum values.

For s_rowstats() and s_colstats(), a vector of summary statistics.

Author(s)

Kylie A. Bemis

References

B. P. Welford. “Note on a Method for Calculating Corrected Sums of Squares and Products.” Technometrics, vol. 4, no. 3, pp. 1-3, Aug. 1962.

B. O'Neill. “Some Useful Moment Results in Sampling Problems.” The American Statistician, vol. 68, no. 4, pp. 282-296, Sep. 2014.

See Also

Summary

Examples

set.seed(1)
x <- sample(1:100, size=10)
y <- sample(1:100, size=10)

sx <- s_var(x)
sy <- s_var(y)

var(c(x, y))
stat_c(sx, sy) # should be the same

sxy <- stat_c(sx, sy)

# calculate with 1 new observation
var(c(x, y, 99))
stat_c(sxy, 99)

# calculate over rows of a matrix
set.seed(2)
A <- matrix(rnorm(100), nrow=10)
B <- matrix(rnorm(100), nrow=10)

sx <- s_rowstats(A, "var")
sy <- s_rowstats(B, "var")

apply(cbind(A, B), 1, var)
stat_c(sx, sy) # should be the same

kuwisdelu/matter documentation built on Dec. 8, 2024, 8:09 p.m.