faststatisticalfunctions  R Documentation 
With fsum
, fprod
, fmean
, fmedian
, fmode
, fvar
, fsd
, fmin
, fmax
, fnth
, ffirst
, flast
, fnobs
and fndistinct
, collapse presents a coherent set of extremely fast and flexible statistical functions (S3 generics) to perform columnwise, grouped and weighted computations on vectors, matrices and data frames, with special support for grouped data frames / tibbles (dplyr) and data.table's.
## All functions (FUN) follow a common syntax in 4 methods: FUN(x, ...) ## Default S3 method: FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, [nthreads = 1L,] ...) ## S3 method for class 'matrix' FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE, [nthreads = 1L,] ...) ## S3 method for class 'data.frame' FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE, [nthreads = 1L,] ...) ## S3 method for class 'grouped_df' FUN(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = FALSE, keep.group_vars = TRUE, [keep.w = TRUE,] [stub = TRUE,] [nthreads = 1L,] ...)
x  a vector, matrix, data frame or grouped data frame (class 'grouped_df').  
g  a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x . 

w  a numeric vector of (nonnegative) weights, may contain missing values. Supported by fsum , fprod , fmean , fmedian , fnth , fvar , fsd and fmode . 

TRA  an integer or quoted operator indicating the transformation to perform:
0  "na"  1  "fill"  2  "replace"  3  ""  4  "+"  5  "/"  6  "%"  7  "+"  8  "*"  9  "%%"  10  "%%". See TRA . 

na.rm  logical. Skip missing values in x . Defaults to TRUE in all functions and implemented at very little computational cost. Not available for fnobs . 

use.g.names  logical. Make groupnames and add to the result as names (default method) or rownames (matrix and data frame methods). No rownames are generated for data.table's.  
nthreads  integer. The number of threads to utilize. Supported by fsum , fmean , fmedian , fnth , fmode and fndistinct . 

drop  matrix and data.frame methods: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL . 

keep.group_vars  grouped_df method: Logical. FALSE removes grouping variables after computation. By default grouping variables are added, even if not present in the grouped_df. 

keep.w  grouped_df method: Logical. TRUE (default) also aggregates weights and saves them in a column, FALSE removes weighting variable after computation (if contained in grouped_df ). 

stub  grouped_df method: Character. If keep.w = TRUE and stub = TRUE (default), the aggregated weights column is prefixed by the name of the aggregation function (mostly "sum." ). Users can specify a different prefix through this argument, or set it to FALSE to avoid prefixing. 

...  arguments to be passed to or from other methods. If TRA is used, passing set = TRUE will transform data by reference and return the result invisibly (except for the grouped_df method which always returns visible output). 

Please see the documentation of individual functions.
x
suitably aggregated or transformed. Data frame columnattributes and overall attributes are generally preserved if the output is of the same data type.
Functions fquantile
and frange
are for atomic vectors.
Paneldecomposed (i.e. between and within) statistics as well as grouped and weighted skewness and kurtosis are implemented in qsu
.
The vectorvalued functions and operators fcumsum
, fscale/STD
, fbetween/B
, fhdbetween/HDB
, fwithin/W
, fhdwithin/HDW
, flag/L/F
, fdiff/D/Dlog
and fgrowth/G
are grouped under Data Transformations and Time Series and Panel Series. These functions also support indexed data (plm).
## default vector method mpg < mtcars$mpg fsum(mpg) # Simple sum fsum(mpg, TRA = "/") # Simple transformation: divide all values by the sum fsum(mpg, mtcars$cyl) # Grouped sum fmean(mpg, mtcars$cyl) # Grouped mean fmean(mpg, w = mtcars$hp) # Weighted mean, weighted by hp fmean(mpg, mtcars$cyl, mtcars$hp) # Grouped mean, weighted by hp fsum(mpg, mtcars$cyl, TRA = "/") # Proportions / division by group sums fmean(mpg, mtcars$cyl, mtcars$hp, # Subtract weighted group means, see also ?fwithin TRA = "") ## data.frame method fsum(mtcars) fsum(mtcars, TRA = "%") # This computes percentages fsum(mtcars, mtcars[c(2,8:9)]) # Grouped column sum g < GRP(mtcars, ~ cyl + vs + am) # Here precomputing the groups! fsum(mtcars, g) # Faster !! fmean(mtcars, g, mtcars$hp) fmean(mtcars, g, mtcars$hp, "") # Demeaning by weighted group means.. fmean(fgroup_by(mtcars, cyl, vs, am), hp, "") # Another way of doing it.. fmode(wlddev, drop = FALSE) # Compute statistical modes of variables in this data fmode(wlddev, wlddev$income) # Grouped statistical modes .. ## matrix method m < qM(mtcars) fsum(m) fsum(m, g) # .. ## method for grouped data frames  created with dplyr::group_by or fgroup_by library(dplyr) mtcars > group_by(cyl,vs,am) > select(mpg,carb) > fsum() mtcars > fgroup_by(cyl,vs,am) > fselect(mpg,carb) > fsum() # equivalent and faster !! mtcars > fgroup_by(cyl,vs,am) > fsum(TRA = "%") mtcars > fgroup_by(cyl,vs,am) > fmean(hp) # weighted grouped mean, save sum of weights mtcars > fgroup_by(cyl,vs,am) > fmean(hp, keep.group_vars = FALSE)
## This compares fsum with data.table (2 threads) and base::rowsum # Starting with small data mtcDT < qDT(mtcars) f < qF(mtcars$cyl) library(microbenchmark) microbenchmark(mtcDT[, lapply(.SD, sum), by = f], rowsum(mtcDT, f, reorder = FALSE), fsum(mtcDT, f, na.rm = FALSE), unit = "relative") # expr min lq mean median uq max neval cld # mtcDT[, lapply(.SD, sum), by = f] 145.436928 123.542134 88.681111 98.336378 71.880479 85.217726 100 c # rowsum(mtcDT, f, reorder = FALSE) 2.833333 2.798203 2.489064 2.937889 2.425724 2.181173 100 b # fsum(mtcDT, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a # Now larger data tdata < qDT(replicate(100, rnorm(1e5), simplify = FALSE)) # 100 columns with 100.000 obs f < qF(sample.int(1e4, 1e5, TRUE)) # A factor with 10.000 groups microbenchmark(tdata[, lapply(.SD, sum), by = f], rowsum(tdata, f, reorder = FALSE), fsum(tdata, f, na.rm = FALSE), unit = "relative") # expr min lq mean median uq max neval cld # tdata[, lapply(.SD, sum), by = f] 2.646992 2.975489 2.834771 3.081313 3.120070 1.2766475 100 c # rowsum(tdata, f, reorder = FALSE) 1.747567 1.753313 1.629036 1.758043 1.839348 0.2720937 100 b # fsum(tdata, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
Collapse Overview, Data Transformations, Time Series and Panel Series
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.