collapse: Advanced and Fast Data Transformation

fnth-fmedian

R Documentation

Fast (Grouped, Weighted) N'th Element/Quantile for Matrix-Like Objects

Description

fnth (column-wise) returns the n'th smallest element from a set of unsorted elements x corresponding to an integer index (n), or to a probability between 0 and 1. If n is passed as a probability, ties can be resolved using the lower, upper, or average of the possible elements, or (default) continuous quantile estimation. For n > 1, the lower element is always returned (as in sort(x, partial = n)[n]). See Details.

fmedian is a simple wrapper around fnth, which fixes n = 0.5 and (default) ties = "mean", i.e., it averages eligible elements. See Details.

Usage

fnth(x, n = 0.5, ...)
fmedian(x, ...)

## Default S3 method:
fnth(x, n = 0.5, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
     use.g.names = TRUE, ties = "q7", nthreads = .op[["nthreads"]],
     o = NULL, check.o = is.null(attr(o, "sorted")), ...)
## Default S3 method:
fmedian(x, ..., ties = "mean")

## S3 method for class 'matrix'
fnth(x, n = 0.5, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
     use.g.names = TRUE, drop = TRUE, ties = "q7", nthreads = .op[["nthreads"]], ...)
## S3 method for class 'matrix'
fmedian(x, ..., ties = "mean")

## S3 method for class 'data.frame'
fnth(x, n = 0.5, g = NULL, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
     use.g.names = TRUE, drop = TRUE, ties = "q7", nthreads = .op[["nthreads"]], ...)
## S3 method for class 'data.frame'
fmedian(x, ..., ties = "mean")

## S3 method for class 'grouped_df'
fnth(x, n = 0.5, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
     use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE, stub = .op[["stub"]],
     ties = "q7", nthreads = .op[["nthreads"]], ...)
## S3 method for class 'grouped_df'
fmedian(x, w = NULL, TRA = NULL, na.rm = .op[["na.rm"]],
        use.g.names = FALSE, keep.group_vars = TRUE, keep.w = TRUE, stub = .op[["stub"]],
        ties = "mean", nthreads = .op[["nthreads"]], ...)

Arguments

x

a numeric vector, matrix, data frame or grouped data frame (class 'grouped_df').

n

the element to return using a single integer index such that 1 < n < NROW(x), or a probability 0 < n < 1. See Details.

g

a factor, GRP object, atomic vector (internally converted to factor) or a list of vectors / factors (internally converted to a GRP object) used to group x.

w

a numeric vector of (non-negative) weights, may contain missing values only where x is also missing.

TRA

an integer or quoted operator indicating the transformation to perform: 0 - "na" | 1 - "fill" | 2 - "replace" | 3 - "-" | 4 - "-+" | 5 - "/" | 6 - "%" | 7 - "+" | 8 - "*" | 9 - "%%" | 10 - "-%%". See TRA.

na.rm

logical. Skip missing values in x. Defaults to TRUE and implemented at very little computational cost. If na.rm = FALSE a NA is returned when encountered.

use.g.names

logical. Make group-names and add to the result as names (default method) or row-names (matrix and data frame methods). No row-names are generated for data.table's.

ties

an integer or character string specifying the method to resolve ties between adjacent qualifying elements:

Int.	String	Description
1	"mean"	take the arithmetic mean of all qualifying elements.
2	"min"	take the smallest of the elements.
3	"max"	take the largest of the elements.
4-9	"qn"	continuous quantile types 4-9, see `fquantile`.

nthreads

integer. The number of threads to utilize. Parallelism is across groups for grouped computations on vectors and data frames, and at the column-level otherwise. See Details.

o

integer. A valid ordering of x, e.g. radixorder(x). With groups, the grouping needs to be accounted e.g. radixorder(g, x).

check.o

logical. TRUE checks that each element of o is within [1, length(x)]. The default uses the fact that orderings from radixorder have a "sorted" attribute which let's fnth infer that the ordering is valid. The length and data type of o is always checked, regardless of check.o.

drop

matrix and data.frame method: Logical. TRUE drops dimensions and returns an atomic vector if g = NULL and TRA = NULL.

keep.group_vars

grouped_df method: Logical. FALSE removes grouping variables after computation.

keep.w

grouped_df method: Logical. Retain sum of weighting variable after computation (if contained in grouped_df).

stub

character. If keep.w = TRUE and stub = TRUE (default), the summed weights column is prefixed by "sum.". Users can specify a different prefix through this argument, or set it to FALSE to avoid prefixing.

...

for fmedian: further arguments passed to fnth (apart from n). If TRA is used, passing set = TRUE will transform data by reference and return the result invisibly.

Details

fnth uses a combination of quickselect, quicksort, and radixsort algorithms, combined with several (weighted) quantile estimation methods and, where possible, OpenMP multithreading:

without weights, quickselect is used to determine a (lower) order statistic. If ties %!in% c("min", "max") a second order statistic is found by taking the max of the upper part of the partitioned array, and the two statistics are averaged using a simple mean (ties = "mean"), or weighted average according to a quantile method (ties = "q4"-"q9"). For n = 0.5, all supported quantile methods give the sample median. With matrices, multithreading is always across columns, for vectors and data frames it is across groups unless is.null(g) for data frames.
with weights and no groups (is.null(g)), radixorder is called internally (on each column of x). The ordering is used to sum the weights in order of x and determine weighted order statistics or quantiles. See details below. Multithreading is disabled as radixorder cannot be called concurrently on the same memory stack.
with weights and groups (!is.null(g)), R's quicksort algorithm is used to sort the data in each group and return an index which can be used to sum the weights in order and proceed as before. This is multithreaded across columns for matrices, and across groups otherwise.
in fnth.default, an ordering of x can be supplied to 'o' e.g. fnth(x, 0.75, o = radixorder(x)). This dramatically speeds up the estimation both with and without weights, and is useful if fnth is to be invoked repeatedly on the same data. With groups, o needs to also account for the grouping e.g. fnth(x, 0.75, g, o = radixorder(g, x)). Multithreading is possible across groups. See Examples.

If n > 1, the result is equivalent to (column-wise) sort(x, partial = n)[n]. Internally, n is converted to a probability using p = (n-1)/(NROW(x)-1), and that probability is applied to the set of non-missing elements to find the as.integer(p*(fnobs(x)-1))+1L'th element (which corresponds to option ties = "min"). When using grouped computations with n > 1, n is transformed to a probability p = (n-1)/(NROW(x)/ng-1) (where ng contains the number of unique groups in g).

If weights are used and ties = "q4"-"q9", weighted continuous quantile estimation is done as described in fquantile.

For ties %in% c("mean", "min", "max"), a target partial sum of weights p*sum(w) is calculated, and the weighted n'th element is the element k such that all elements smaller than k have a sum of weights <= p*sum(w), and all elements larger than k have a sum of weights <= (1 - p)*sum(w). If the partial-sum of weights (p*sum(w)) is reached exactly for some element k, then (summing from the lower end) both k and k+1 would qualify as the weighted n'th element. If the weight of element k+1 is zero, k, k+1 and k+2 would qualify... . If n > 1, k is chosen (consistent with the unweighted behavior). If 0 < n < 1, the ties option regulates how to resolve such conflicts, yielding lower (ties = "min": k), upper (ties = "max": k+2) or average weighted (ties = "mean": mean(k, k+1, k+2)) n'th elements.

Thus, in the presence of zero weights, the weighted median (default ties = "mean") can be an arithmetic average of >2 qualifying elements.

For data frames, column-attributes and overall attributes are preserved if g is used or drop = FALSE.

Value

The (w weighted) n'th element/quantile of x, grouped by g, or (if TRA is used) x transformed by its (grouped, weighted) n'th element/quantile.

Examples

## default vector method
mpg <- mtcars$mpg
fnth(mpg)                         # Simple nth element: Median (same as fmedian(mpg))
fnth(mpg, 5)                      # 5th smallest element
sort(mpg, partial = 5)[5]         # Same using base R, fnth is 2x faster.
fnth(mpg, 0.75)                   # Third quartile
fnth(mpg, 0.75, w = mtcars$hp)    # Weighted third quartile: Weighted by hp
fnth(mpg, 0.75, TRA = "-")        # Simple transformation: Subtract third quartile
fnth(mpg, 0.75, mtcars$cyl)             # Grouped third quartile
fnth(mpg, 0.75, mtcars[c(2,8:9)])       # More groups..
g <- GRP(mtcars, ~ cyl + vs + am)       # Precomputing groups gives more speed !
fnth(mpg, 0.75, g)
fnth(mpg, 0.75, g, mtcars$hp)           # Grouped weighted third quartile
fnth(mpg, 0.75, g, TRA = "-")           # Groupwise subtract third quartile
fnth(mpg, 0.75, g, mtcars$hp, "-")      # Groupwise subtract weighted third quartile

## data.frame method
fnth(mtcars, 0.75)
head(fnth(mtcars, 0.75, TRA = "-"))
fnth(mtcars, 0.75, g)
fnth(fgroup_by(mtcars, cyl, vs, am), 0.75)   # Another way of doing it..
fnth(mtcars, 0.75, g, use.g.names = FALSE)   # No row-names generated

## matrix method
m <- qM(mtcars)
fnth(m, 0.75)
head(fnth(m, 0.75, TRA = "-"))
fnth(m, 0.75, g) # etc..

## method for grouped data frames - created with dplyr::group_by or fgroup_by
mtcars |> fgroup_by(cyl,vs,am) |> fnth(0.75)
mtcars |> fgroup_by(cyl,vs,am) |> fnth(0.75, hp)         # Weighted
mtcars |> fgroup_by(cyl,vs,am) |> fnth(0.75, TRA = "/")  # Divide by third quartile
mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg, hp) |>    # Faster selecting
      fnth(0.75, hp, "/")  # Divide mpg by its third weighted group-quartile, using hp as weights

# Efficient grouped estimation of multiple quantiles
mtcars |> fgroup_by(cyl,vs,am) |>
    fmutate(o = radixorder(GRPid(), mpg)) |>
    fsummarise(mpg_Q1 = fnth(mpg, 0.25, o = o),
               mpg_median = fmedian(mpg, o = o),
               mpg_Q3 = fnth(mpg, 0.75, o = o))

## fmedian()
fmedian(mpg)                         # Simple median value
fmedian(mpg, w = mtcars$hp)          # Weighted median: Weighted by hp
fmedian(mpg, TRA = "-")              # Simple transformation: Subtract median value
fmedian(mpg, mtcars$cyl)             # Grouped median value
fmedian(mpg, mtcars[c(2,8:9)])       # More groups..
fmedian(mpg, g)
fmedian(mpg, g, mtcars$hp)           # Grouped weighted median
fmedian(mpg, g, TRA = "-")           # Groupwise subtract median value
fmedian(mpg, g, mtcars$hp, "-")      # Groupwise subtract weighted median value

## data.frame method
fmedian(mtcars)
head(fmedian(mtcars, TRA = "-"))
fmedian(mtcars, g)
fmedian(fgroup_by(mtcars, cyl, vs, am))   # Another way of doing it..
fmedian(mtcars, g, use.g.names = FALSE)   # No row-names generated

## matrix method
fmedian(m)
head(fmedian(m, TRA = "-"))
fmedian(m, g) # etc..

## method for grouped data frames - created with dplyr::group_by or fgroup_by
mtcars |> fgroup_by(cyl,vs,am) |> fmedian()
mtcars |> fgroup_by(cyl,vs,am) |> fmedian(hp)           # Weighted
mtcars |> fgroup_by(cyl,vs,am) |> fmedian(TRA = "-")    # De-median
mtcars |> fgroup_by(cyl,vs,am) |> fselect(mpg, hp) |>   # Faster selecting
      fmedian(hp, "-")  # Weighted de-median mpg, using hp as weights

SebKrantz/collapse documentation built on June 12, 2025, 1:44 a.m.