fast-statistical-functions | R Documentation |
With fsum
, fprod
, fmean
, fmedian
, fmode
, fvar
, fsd
, fmin
, fmax
, fnth
, ffirst
, flast
, fnobs
and fndistinct
, collapse presents a coherent set of extremely fast and flexible statistical functions (S3 generics) to perform column-wise, grouped and weighted computations on vectors, matrices and data frames, with special support for grouped data frames / tibbles (dplyr) and data.table's.
## All functions (FUN) follow a common syntax in 4 methods: FUN(x, ...) ## Default S3 method: FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, [nthreads = 1L,] ...) ## S3 method for class 'matrix' FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE, [nthreads = 1L,] ...) ## S3 method for class 'data.frame' FUN(x, g = NULL, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = TRUE, drop = TRUE, [nthreads = 1L,] ...) ## S3 method for class 'grouped_df' FUN(x, [w = NULL,] TRA = NULL, [na.rm = TRUE,] use.g.names = FALSE, keep.group_vars = TRUE, [keep.w = TRUE,] [stub = TRUE,] [nthreads = 1L,] ...)
x | a vector, matrix, data frame or grouped data frame (class 'grouped_df'). | |
g | a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x . |
|
w | a numeric vector of (non-negative) weights, may contain missing values. Supported by fsum , fprod , fmean , fmedian , fnth , fvar , fsd and fmode . |
|
TRA | an integer or quoted operator indicating the transformation to perform:
0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See TRA . |
|
na.rm | logical. Skip missing values in x . Defaults to TRUE in all functions and implemented at very little computational cost. Not available for fnobs . |
|
use.g.names | logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's. | |
nthreads | integer. The number of threads to utilize. Supported by fsum , fmean , fmedian , fnth , fmode and fndistinct . |
|
drop | matrix and data.frame methods: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL . |
|
keep.group_vars | grouped_df method: Logical. FALSE removes grouping variables after computation. By default grouping variables are added, even if not present in the grouped_df. |
|
keep.w | grouped_df method: Logical. TRUE (default) also aggregates weights and saves them in a column, FALSE removes weighting variable after computation (if contained in grouped_df ). |
|
stub | grouped_df method: Character. If keep.w = TRUE and stub = TRUE (default), the aggregated weights column is prefixed by the name of the aggregation function (mostly "sum." ). Users can specify a different prefix through this argument, or set it to FALSE to avoid prefixing. |
|
... | arguments to be passed to or from other methods. If TRA is used, passing set = TRUE will transform data by reference and return the result invisibly (except for the grouped_df method which always returns visible output). |
|
Please see the documentation of individual functions.
x
suitably aggregated or transformed. Data frame column-attributes and overall attributes are generally preserved if the output is of the same data type.
Functions fquantile
and frange
are for atomic vectors.
Panel-decomposed (i.e. between and within) statistics as well as grouped and weighted skewness and kurtosis are implemented in qsu
.
The vector-valued functions and operators fcumsum
, fscale/STD
, fbetween/B
, fhdbetween/HDB
, fwithin/W
, fhdwithin/HDW
, flag/L/F
, fdiff/D/Dlog
and fgrowth/G
are grouped under Data Transformations and Time Series and Panel Series. These functions also support indexed data (plm).
## default vector method mpg <- mtcars$mpg fsum(mpg) # Simple sum fsum(mpg, TRA = "/") # Simple transformation: divide all values by the sum fsum(mpg, mtcars$cyl) # Grouped sum fmean(mpg, mtcars$cyl) # Grouped mean fmean(mpg, w = mtcars$hp) # Weighted mean, weighted by hp fmean(mpg, mtcars$cyl, mtcars$hp) # Grouped mean, weighted by hp fsum(mpg, mtcars$cyl, TRA = "/") # Proportions / division by group sums fmean(mpg, mtcars$cyl, mtcars$hp, # Subtract weighted group means, see also ?fwithin TRA = "-") ## data.frame method fsum(mtcars) fsum(mtcars, TRA = "%") # This computes percentages fsum(mtcars, mtcars[c(2,8:9)]) # Grouped column sum g <- GRP(mtcars, ~ cyl + vs + am) # Here precomputing the groups! fsum(mtcars, g) # Faster !! fmean(mtcars, g, mtcars$hp) fmean(mtcars, g, mtcars$hp, "-") # Demeaning by weighted group means.. fmean(fgroup_by(mtcars, cyl, vs, am), hp, "-") # Another way of doing it.. fmode(wlddev, drop = FALSE) # Compute statistical modes of variables in this data fmode(wlddev, wlddev$income) # Grouped statistical modes .. ## matrix method m <- qM(mtcars) fsum(m) fsum(m, g) # .. ## method for grouped data frames - created with dplyr::group_by or fgroup_by library(dplyr) mtcars |> group_by(cyl,vs,am) |> select(mpg,carb) |> fsum() mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg,carb) |> fsum() # equivalent and faster !! mtcars |> fgroup_by(cyl,vs,am) |> fsum(TRA = "%") mtcars |> fgroup_by(cyl,vs,am) |> fmean(hp) # weighted grouped mean, save sum of weights mtcars |> fgroup_by(cyl,vs,am) |> fmean(hp, keep.group_vars = FALSE)
## This compares fsum with data.table (2 threads) and base::rowsum # Starting with small data mtcDT <- qDT(mtcars) f <- qF(mtcars$cyl) library(microbenchmark) microbenchmark(mtcDT[, lapply(.SD, sum), by = f], rowsum(mtcDT, f, reorder = FALSE), fsum(mtcDT, f, na.rm = FALSE), unit = "relative") # expr min lq mean median uq max neval cld # mtcDT[, lapply(.SD, sum), by = f] 145.436928 123.542134 88.681111 98.336378 71.880479 85.217726 100 c # rowsum(mtcDT, f, reorder = FALSE) 2.833333 2.798203 2.489064 2.937889 2.425724 2.181173 100 b # fsum(mtcDT, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.000000 100 a # Now larger data tdata <- qDT(replicate(100, rnorm(1e5), simplify = FALSE)) # 100 columns with 100.000 obs f <- qF(sample.int(1e4, 1e5, TRUE)) # A factor with 10.000 groups microbenchmark(tdata[, lapply(.SD, sum), by = f], rowsum(tdata, f, reorder = FALSE), fsum(tdata, f, na.rm = FALSE), unit = "relative") # expr min lq mean median uq max neval cld # tdata[, lapply(.SD, sum), by = f] 2.646992 2.975489 2.834771 3.081313 3.120070 1.2766475 100 c # rowsum(tdata, f, reorder = FALSE) 1.747567 1.753313 1.629036 1.758043 1.839348 0.2720937 100 b # fsum(tdata, f, na.rm = FALSE) 1.000000 1.000000 1.000000 1.000000 1.000000 1.0000000 100 a
Collapse Overview, Data Transformations, Time Series and Panel Series
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.