knitr::opts_chunk$set(
  collapse = TRUE,
  comment = "#>",
  fig.path = "man/figures/README-",
  out.width = "100%"
)

fqr: Fast (and Accurate) Quantile Regression

Codecov test coverage R-CMD-check

The fqr package makes quantile regression fast and scaleable using accelerated gradient descent. This project began as an attempt at a smoothed version of the quantile objective function based on Nesterov's 2005 paper for use in the quantspace package for problems too large for interior point methods to handle. The conquer package has a similar, and really efficient approach to gradient descent based on kernel approximations of the quantile loss function, and an extremely useful step-size selection approach. Initially, the relevant code in this package was drawn from their implementation in the conquer package and paper. Now this package uses a backtracking line search method with acceleration rather than the Barzilai-Borwein approach from conquer, but the conquer package is a really incredible piece of work you should also check out (it's often slightly faster than this pacakge).

fqrcan handle quantile regression problems on the order of 10 million rows and 100 columns in less than a minute, and can exactly match existing implementations on small problems.

While the quantile loss function isn’t differentiable, you can get an arbitrarily close smooth approximation by replacing the “check” function with an appropriately tilted least squares approximation for a small neighborhood around the origin. As the size of that window goes to zero, you have your check function back!

The package uses 2 stopping rules to assess convergence: the maximum value of the gradient vector (for the coefficients of the quantile regression) and the relative change in the loss function (scaled by the step size).

Installation

You can install the fqr package from github by running

# get remotes if needed:
# install.packages("remotes")

remotes::install_github("be-green/fqr")

Basic Use

The fqr package uses the same basic formula interface that lm does, with standard errors calculated based on subsampling.

library(fqr)
data(rocks)

fqr(area ~ peri, data = rock, tau = c(0.25, 0.5, 0.75))

To turn off standard errors (and just get point predictions), you can set se = F.

fqr(area ~ peri, data = rock, tau = 0.25, nsubsamples = 100)

Benchmarks

Ok, but how fast is this approach? Let's just take some point estimates and see how it goes.

Medium N, Medium P

But with all of this done, let's compare to some benchmarks from the sfn and pfn algorithms, which are currently the fastest in the quantreg package.

p <- 50
n <- 1e6
beta <- rnorm(p + 1)

x <- cbind(1, matrix(rnorm(p * n), ncol = p, nrow = n))
y <- x %*% beta + exp(rnorm(n, sd = 2))

# let's take a look at what this looks like
hist(y)

Ok so we have some very skewed data! Perfect for median regression.

# remove the intercept since it's implied in the formula
start = proc.time()
# lower level version that just takes design matrix
fit <- fit_fqr(x, y, tau = 0.5, se = F)
end = proc.time()
end - start

I attempted to run the same thing with the quantreg package, with the method advised for large datasets, like so:

# newton interior point method w/ pre-processing
start <- proc.time()
fit_pfn <- quantreg::rq.fit.pfn(x, y, tau = 0.5)
end <- proc.time()
end - start

So for a reasonably sized problem, fqr is ~2 times as fast?

Big N, Big P

Let's benchmark with a bigger set of columns & rows.

p <- 100
n <- 3e6
beta <- rnorm(p + 1)

x <- cbind(1, matrix(rnorm(p * n), ncol = p, nrow = n))
y <- 10 + x %*% beta + exp(rnorm(n, sd = 2))
start = proc.time()
fit <- fit_fqr(x, y, tau = 0.5, se = F)
end = proc.time()
end - start

I'm not going to run the quantreg pfn algorithm since it was so slow for the last problem. seconds.

Big N, Small P

Let's try a more manageable set of dimensions, with lots of observations.

p <- 10
n <- 1e8
beta <- rnorm(p + 1)

x <- cbind(1, matrix(rnorm(p * n), ncol = p, nrow = n))
y <- x %*% beta + exp(rnorm(n, sd = 2))

start = proc.time()
fit <- fit_fqr(X = x, y = y, tau = 0.5, se = F)
end = proc.time()
end - start

I attempted to do the comparable thing for the pfn algorithm:

start = proc.time()
fit <- quantreg::rq.fit.pfn(x = x, y = y, tau = 0.5)
end = proc.time()
end - start

...but I killed the process after 15 minutes or so.

Medium-scale Problem

Ok, so we haven't been able to run quantreg on these datasets, let's see how it does with a sort of medium-scale problem. Let's use the same DGP.

p <- 10
n <- 1e5
beta <- rnorm(p + 1)

x <- cbind(1, matrix(rnorm(p * n), ncol = p, nrow = n))
y <- x %*% beta + exp(rnorm(n, sd = 2))

start = proc.time()
fit <- fit_fqr(X = x, y = y, tau = 0.5, se = F)
end = proc.time()
end - start
start = proc.time()
fit_pfn <- quantreg::rq.fit.pfn(x = x, y = y, tau = 0.5)
end = proc.time()
end - start

The coefficients match out to the 4th or 5th decimal place:

max(abs(fit$coefficients - fit_pfn$coefficients))
min(abs(fit$coefficients - fit_pfn$coefficients))

Small Problems

It can also be faster for small problems, and with conservative tolerance parameters will come extremely close to the default quantreg outputs. Here's an example:

# simulate some data, 3 x 10,000
p <- 3
n <- 10000
beta <- rnorm(p)
x <- cbind(matrix(rnorm(p * n), ncol = p, nrow = n))
y <- 10 + x %*% beta + exp(rnorm(n, sd = 2))

microbenchmark::microbenchmark(
  fqr_fit <- fqr(y ~ ., se = F,
                 data = data.frame(y = y, x)),
  br_fit <- quantreg::rq(y ~ ., tau = 0.5, 
                    data = data.frame(y = y, x), method = "br"),
                    times = 100
)

The coefficients match out to 3 decimal places:

fqr_fit$coefficients - br_fit$coefficients

And the check loss is nearly identical:

check <- function (x, tau = 0.5) {
  sum(x * (tau - (x < 0)))
}

check(fqr_fit$residuals) -  check(br_fit$residuals)

You can get even more accurate results (at the cost of some speed) by lowering the tolerance for the gradient and loss function.

Still, though, the speed gains are most noticeable once N and P are "medium" or larger, otherwise probably just use quantreg.



be-green/fqr documentation built on Dec. 19, 2021, 7:41 a.m.