fquantile: Fast (Weighted) Sample Quantiles and Range

View source: R/my_RcppExports.R

fquantileR Documentation

Fast (Weighted) Sample Quantiles and Range

Description

A faster alternative to quantile (written fully in C), that supports sampling weights, and can also quickly compute quantiles from an ordering vector (e.g. order(x)). frange provides a fast alternative to range.

Usage

fquantile(x, probs = c(0, 0.25, 0.5, 0.75, 1), w = NULL,
          o = if(length(x) > 1e5L && length(probs) > log(length(x)))
              radixorder(x) else NULL,
          na.rm = .op[["na.rm"]], type = 7L, names = TRUE,
          check.o = is.null(attr(o, "sorted")))

# Programmers version: no names, intelligent defaults, or checks
.quantile(x, probs = c(0, 0.25, 0.5, 0.75, 1), w = NULL, o = NULL,
          na.rm = TRUE, type = 7L, names = FALSE, check.o = FALSE)

# Fast range (min and max)
frange(x, na.rm = .op[["na.rm"]], finite = FALSE)
.range(x, na.rm = TRUE, finite = FALSE)

Arguments

x

a numeric or integer vector.

probs

numeric vector of probabilities with values in [0,1].

w

a numeric vector of strictly positive sampling weights. Missing weights are only supported if x is also missing.

o

integer. An vector giving the ordering of the elements in x, such that identical(x[o], sort(x)). If available this considerably speeds up the estimation.

na.rm

logical. Remove missing values, default TRUE.

finite

logical. Omit all non-finite values.

type

integer. Quantile types 4-9. See quantile. Further details are provided in Hyndman and Fan (1996) who recommended type 8. The default method is type 7.

names

logical. Generates names of the form paste0(round(probs * 100, 1), "%") (in C). Set to FALSE for speedup.

check.o

logical. If o is supplied, TRUE runs through o once and checks that it is valid, i.e. that each element is in [1, length(x)]. Set to FALSE for significant speedup if o is known to be valid.

Details

fquantile is implemented using a quickselect algorithm in C, inspired by data.table's gmedian. The algorithm is applied incrementally to different sections of the array to find individual quantiles. If many quantile probabilities are requested, sorting the whole array with the fast radixorder algorithm is more efficient. The default threshold for this (length(x) > 1e5L && length(probs) > log(length(x))) is conservative, given that quickselect is generally more efficient on longitudinal data with similar values repeated by groups. With random data, my investigations yield that a threshold of length(probs) > log10(length(x)) would be more appropriate.

frange is considerably more efficient than range, requiring only one pass through the data instead of two. For probabilities 0 and 1, fquantile internally calls frange.

Following Hyndman and Fan (1996), the quantile type-i quantile function of the sample X can be written as a weighted average of two order statistics:

\hat{Q}_{X,i}(p) = (1 - \gamma) X_{(j)} + \gamma X_{(j + 1)}

where j = \lfloor pn + m \rfloor,\ m \in \mathbb{R} and \gamma = pn + m - j,\ 0 \le \gamma \le 1, with m differing by quantile type (i). For example, the default type 7 quantile estimator uses m = 1 - p, see quantile.

For weighted data with normalized weights w = \{w_1, ..., w_n\}, where w_k > 0 and \sum_k w_k = 1, let \{w_{(1)}, ..., w_{(n)}\} be the weights for each order statistic and W_{(k)} = \operatorname{Weight}[X_j \le X_{(k)}] = \sum_{j=1}^k w_{(j)} the cumulative weight for each order statistic.

We can then first find the largest value l such that the cumulative normalized weight W_{(l)} \leq p, and replace pn with l + (p - W_{(l)})/w_{(l+1)}, where w_{(l+1)} is the weight of the next observation. This gives:

j = \lfloor l + \frac{p - W_{(l)}}{w_{(l+1)}} + m \rfloor

\gamma = l + \frac{p - W_{(l)}}{w_{(l+1)}} + m - j

For a more detailed exposition see these excellent notes by Matthew Kay. See also the R implementation of weighted quantiles type 7 in the Examples below.

Value

A vector of quantiles. If names = TRUE, fquantile generates names as paste0(round(probs * 100, 1), "%") (in C).

Note

The new weighted quantile algorithm from v2.1.0 does not skip zero weights anymore as this is technically very difficult (it is not clear if j hits a zero weight element whether one should move forward or backward to find an alternative). Thus, all non-missing elements are considered and weights should be strictly positive.

Author(s)

Sebastian Krantz based on notes by Matthew Kay.

References

Hyndman, R. J. and Fan, Y. (1996) Sample quantiles in statistical packages, American Statistician 50, 361–365. doi:10.2307/2684934.

Wicklin, R. (2017) Sample quantiles: A comparison of 9 definitions; SAS Blog. https://blogs.sas.com/content/iml/2017/05/24/definitions-sample-quantiles.html

Wikipedia: https://en.wikipedia.org/wiki/Quantile#Estimating_quantiles_from_a_sample

Weighted Quantiles by Matthew Kay: https://htmlpreview.github.io/?https://github.com/mjskay/uncertainty-examples/blob/master/weighted-quantiles.html

See Also

fnth, Fast Statistical Functions, Collapse Overview

Examples

## Basic range and quantiles
frange(mtcars$mpg)
fquantile(mtcars$mpg)

## Checking computational equivalence to stats::quantile()
w = alloc(abs(rnorm(1)), 32)
o = radixorder(mtcars$mpg)
for (i in 5:9) print(all_obj_equal(fquantile(mtcars$mpg, type = i),
                                   fquantile(mtcars$mpg, type = i, w = w),
                                   fquantile(mtcars$mpg, type = i, o = o),
                                   fquantile(mtcars$mpg, type = i, w = w, o = o),
                                    quantile(mtcars$mpg, type = i)))

## Demonstaration: weighted quantiles type 7 in R
wquantile7R <- function(x, w, probs = c(0.25, 0.5, 0.75), na.rm = TRUE, names = TRUE) {
  if(na.rm && anyNA(x)) {             # Removing missing values (only in x)
    cc = whichNA(x, invert = TRUE)    # The C code first calls radixorder(x), which places
    x = x[cc]; w = w[cc]              # missing values last, so removing = early termination
  }
  o = radixorder(x)                   # Ordering
  wo = proportions(w[o])
  Wo = cumsum(wo)                     # Cumulative sum
  res = sapply(probs, function(p) {
    l = which.max(Wo > p) - 1L        # Lower order statistic
    s = l + (p - Wo[l])/wo[l+1L] + 1 - p
    j = floor(s)
    gamma = s - j
    (1 - gamma) * x[o[j]] + gamma * x[o[j+1L]]  # Weighted quantile
  })
  if(names) names(res) = paste0(as.integer(probs * 100), "%")
  res
} # Note: doesn't work for min and max.

wquantile7R(mtcars$mpg, mtcars$wt)

all.equal(wquantile7R(mtcars$mpg, mtcars$wt),
          fquantile(mtcars$mpg, c(0.25, 0.5, 0.75), mtcars$wt))

## Efficient grouped quantile estimation: use .quantile for less call overhead
BY(mtcars$mpg, mtcars$cyl, .quantile, names = TRUE, expand.wide = TRUE)
BY(mtcars, mtcars$cyl, .quantile, names = TRUE)
mtcars |> fgroup_by(cyl) |> BY(.quantile)

## With weights
BY(mtcars$mpg, mtcars$cyl, .quantile, w = mtcars$wt, names = TRUE, expand.wide = TRUE)
BY(mtcars, mtcars$cyl, .quantile, w = mtcars$wt, names = TRUE)
mtcars |> fgroup_by(cyl) |> fselect(-wt) |> BY(.quantile, w = mtcars$wt)
mtcars |> fgroup_by(cyl) |> fsummarise(across(-wt, .quantile, w = wt))


collapse documentation built on April 15, 2025, 1:10 a.m.