knitr::opts_chunk$set( collapse = TRUE, comment = "#>", fig.path = "man/figures/README-", out.width = "100%" )
The fqr
package makes quantile regression fast and scaleable using
accelerated gradient descent. This project began as an attempt at a smoothed version of the quantile objective function based on Nesterov's 2005 paper for use in the quantspace package for problems too large for interior point methods to handle. The conquer package has a similar, and really efficient approach to gradient descent based on kernel approximations of the quantile loss function, and an extremely useful step-size selection approach. Initially, the relevant code in this package was drawn from their implementation in the conquer package and paper. Now this package uses a backtracking line search method with acceleration rather than the Barzilai-Borwein approach from conquer
, but the conquer
package is a really incredible piece of work you should also check out (it's often slightly faster than this pacakge).
fqr
can handle quantile regression problems on the order of 10 million rows
and 100 columns in less than a minute, and can exactly match existing
implementations on small problems.
While the quantile loss function isn’t differentiable, you can get an arbitrarily close smooth approximation by replacing the “check” function with an appropriately tilted least squares approximation for a small neighborhood around the origin. As the size of that window goes to zero, you have your check function back!
The package uses 2 stopping rules to assess convergence: the maximum value of the gradient vector (for the coefficients of the quantile regression) and the relative change in the loss function (scaled by the step size).
You can install the fqr package from github by running
# get remotes if needed: # install.packages("remotes") remotes::install_github("be-green/fqr")
The fqr
package uses the same basic formula interface that lm
does, with
standard errors calculated based on subsampling.
library(fqr) data(rocks) fqr(area ~ peri, data = rock, tau = c(0.25, 0.5, 0.75))
To turn off standard errors (and just get point predictions), you can set
se = F
.
fqr(area ~ peri, data = rock, tau = 0.25, nsubsamples = 100)
Ok, but how fast is this approach? Let's just take some point estimates and see how it goes.
But with all of this done, let's compare to some benchmarks from the sfn
and pfn
algorithms, which are currently the fastest in the quantreg
package.
p <- 50 n <- 1e6 beta <- rnorm(p + 1) x <- cbind(1, matrix(rnorm(p * n), ncol = p, nrow = n)) y <- x %*% beta + exp(rnorm(n, sd = 2)) # let's take a look at what this looks like hist(y)
Ok so we have some very skewed data! Perfect for median regression.
# remove the intercept since it's implied in the formula start = proc.time() # lower level version that just takes design matrix fit <- fit_fqr(x, y, tau = 0.5, se = F) end = proc.time() end - start
I attempted to run the same thing with the quantreg
package, with the method advised for large datasets, like so:
# newton interior point method w/ pre-processing start <- proc.time() fit_pfn <- quantreg::rq.fit.pfn(x, y, tau = 0.5) end <- proc.time() end - start
So for a reasonably sized problem, fqr is ~2 times as fast?
Let's benchmark with a bigger set of columns & rows.
p <- 100 n <- 3e6 beta <- rnorm(p + 1) x <- cbind(1, matrix(rnorm(p * n), ncol = p, nrow = n)) y <- 10 + x %*% beta + exp(rnorm(n, sd = 2))
start = proc.time() fit <- fit_fqr(x, y, tau = 0.5, se = F) end = proc.time() end - start
I'm not going to run the quantreg pfn
algorithm since it was so slow for the last problem.
seconds.
Let's try a more manageable set of dimensions, with lots of observations.
p <- 10 n <- 1e8 beta <- rnorm(p + 1) x <- cbind(1, matrix(rnorm(p * n), ncol = p, nrow = n)) y <- x %*% beta + exp(rnorm(n, sd = 2)) start = proc.time() fit <- fit_fqr(X = x, y = y, tau = 0.5, se = F) end = proc.time() end - start
I attempted to do the comparable thing for the pfn
algorithm:
start = proc.time() fit <- quantreg::rq.fit.pfn(x = x, y = y, tau = 0.5) end = proc.time() end - start
...but I killed the process after 15 minutes or so.
Ok, so we haven't been able to run quantreg on these datasets, let's see how it does with a sort of medium-scale problem. Let's use the same DGP.
p <- 10 n <- 1e5 beta <- rnorm(p + 1) x <- cbind(1, matrix(rnorm(p * n), ncol = p, nrow = n)) y <- x %*% beta + exp(rnorm(n, sd = 2)) start = proc.time() fit <- fit_fqr(X = x, y = y, tau = 0.5, se = F) end = proc.time() end - start
start = proc.time() fit_pfn <- quantreg::rq.fit.pfn(x = x, y = y, tau = 0.5) end = proc.time() end - start
The coefficients match out to the 4th or 5th decimal place:
max(abs(fit$coefficients - fit_pfn$coefficients))
min(abs(fit$coefficients - fit_pfn$coefficients))
It can also be faster for small problems, and with conservative tolerance parameters will come extremely close to the default quantreg
outputs. Here's an example:
# simulate some data, 3 x 10,000 p <- 3 n <- 10000 beta <- rnorm(p) x <- cbind(matrix(rnorm(p * n), ncol = p, nrow = n)) y <- 10 + x %*% beta + exp(rnorm(n, sd = 2)) microbenchmark::microbenchmark( fqr_fit <- fqr(y ~ ., se = F, data = data.frame(y = y, x)), br_fit <- quantreg::rq(y ~ ., tau = 0.5, data = data.frame(y = y, x), method = "br"), times = 100 )
The coefficients match out to 3 decimal places:
fqr_fit$coefficients - br_fit$coefficients
And the check loss is nearly identical:
check <- function (x, tau = 0.5) { sum(x * (tau - (x < 0))) } check(fqr_fit$residuals) - check(br_fit$residuals)
You can get even more accurate results (at the cost of some speed) by lowering the tolerance for the gradient and loss function.
Still, though, the speed gains are most noticeable once N and P are "medium" or larger, otherwise probably just use quantreg.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.