title: "Qtools: A Collection of Models and Tools for Quantile Inference" output: rmarkdown::html_vignette: toc: true bibliography: Qtools.bib author: Marco Geraci, University of South Carolina vignette: > %\VignetteIndexEntry{Qtools: A Collection of Models and Tools for Quantile Inference} %\VignetteEngine{knitr::rmarkdown} %\usepackage[utf8]{inputenc}
library(knitr) #Determine the output format of the document outputFormat = opts_knit$get("rmarkdown.pandoc.to") #Figure and Table Caption Numbering, for HTML do it manually capTabNo = 1; capFigNo = 1; #Function to add the Table Number capTab = function(x){ if(outputFormat == 'html'){ x = paste0("Table ",capTabNo,". ",x) capTabNo <<- capTabNo + 1 }; x } #Function to add the Figure Number capFig = function(x){ if(outputFormat == 'html'){ x = paste0("Figure ",capFigNo,". ",x) capFigNo <<- capFigNo + 1 }; x }
Quantiles play a fundamental role in statistics. The quantile function defines the distribution of a random variable and, thus, provides a way to describe the data that is specular but equivalent to that given by the corresponding cumulative distribution function. There are many advantages in working with quantiles, starting from their properties. The renewed interest in their usage seen in the last years is due to the theoretical, methodological, and software contributions that have broadened their applicability. This paper presents the R package Qtools, a collection of utilities for unconditional and conditional quantiles.
Quantiles have a long history in applied statistics, especially the median. The analysis of astronomical data by Galileo Galilei in 1632 [@hald_2003, p.149] and geodic measurements by Roger Boscovich in 1757 [@koenker_2005, p.2] are presumably the earliest examples of application of the least absolute deviation ($L_1$) estimator in its, respectively, unconditional and conditional forms. The theoretical studies on quantiles of continuous random variables started to appear in the statistical literature of the 20th century. In the case of discrete data, studies have somewhat lagged behind most probably because of the analytical drawbacks surrounding the discontinuities that characterise discrete quantile functions. Some forms of approximation to continuity have been recently proposed to study the large sample behavior of quantile estimators. For example, @ma_etal_2011 have demonstrated the asymptotic normality of unconditional sample quantiles based on the definition of the mid-distribution function [@parzen_2004]. @machado_silva_2005 proposed inferential approaches to the estimation of conditional quantiles for counts based on data jittering.
Functions implementing quantile methods can be found in common statistical software. A considerable number of R packages that provide such functions are available on the Comprehensive R Archive Network (CRAN). The base package stats contains basic functions to estimate sample quantiles or compute quantiles of common parametric distributions. The quantreg package [@quantreg] is arguably a benchmark for distribution-free estimation of linear quantile regression models, as well as the base for other packages which make use of linear programming (LP) algorithms [@koenker_dorey;@koenker_park_1996]. Other contributions to the modelling of conditional quantile functions include packages for Bayesian regression, e.g. bayesQR [@benoit_etal] and BSquare [@smith_reich], and the lqmm package [@geraci_bottai_2014;@geraci_2104] for random-effects regression.
The focus of this paper is on the R package Qtools, a collection of models and tools for quantile inference. These include commands for
Let $Y$ be a random variable with cumulative distribution function (CDF) $F_{Y}$ and support $S_{Y}$. The CDF calculated at $y \in S_{Y}$ returns the probability $F_{Y}(y) \equiv p = \Pr\left(Y \leq y\right)$. The quantile function (QF) is defined as $Q(p) = \inf_{y}{F_{Y}(y) \geq p}$, for $0 < p < 1$. (Some authors consider $0 \leq p \leq 1$. For practical purposes, it is simpler to exclude the endpoints 0 and 1.) When $F_{Y}$ is continuous and strictly monotone (hence, $f_{Y}(y) \equiv F_{Y}'(y) > 0$ for all $y \in S_{Y}$), the quantile function is simply the inverse of $F_{Y}$. In other cases, the quantile $p$ is defined, by convention, as the smallest value $y$ such that $F_{Y}(y)$ is at least $p$.
Quantiles enjoy a number of properties. An excellent overview is given by @gilchirst_2000. In particular, the Q-tranformation rule [@gilchirst_2000] or equivariance to monotone transformations states that if $h(\cdot)$ is a non-decreasing function on $\mathbb{R}$, then $Q_{h(Y)}(p) = h\left{Q_{Y}(p)\right}$. Hence $Q_{Y}(p) = h^{-1}\left{Q_{h(Y)}(p)\right}$. Note that this property does not generally hold for the expected value.
Sample quantiles for a random variable $Y$ can be calculated in a number of ways, depending on how they are defined [@hyndman_fan_1996]. For example, the function quantile in the base package stats provides nine different sample quantile estimators, which are based on the sample order statistics or the inverse of the empirical CDF. These estimators are distribution-free as they do not depend on any parametric assumption about $F$ (or $Q$).
Let $Y_{1}, Y_{2}, \ldots, Y_{n}$ be a sample of $n$ independent and identically distributed (iid) observations from the population $F_{Y}$. Let $\xi_p$ denote the $p$th population quantile and $\hat{\xi}p$ the corresponding sample quantile. (The subscripts will be dropped occasionally to ease notation, e.g.\ $F$ will be used in place of $F{Y}$ or $\xi$ in place of $\xi_{p}$.) In the continuous case, it is well known that $\sqrt{n}\left(\hat{\xi}{p} - \xi{p}\right)$ is approximately normal with mean zero and variance \begin{equation}\label{eq:1} \tag{1} \omega^2 = \frac{p(1-p)}{{f_{Y}(\xi_p)}^2}. \end{equation} A more general result is obtained when the $Y_{i}$'s, $i = 1, \ldots, n$, are independent but not identically distributed (nid). The density evaluated at the $p$th quantile, $f(\xi_p)$, is called density-quantile function by @parzen_1979. Its reciprocal, $s(p) \equiv 1/f(\xi_p)$, is called sparsity function [@tukey] or quantile-density function [@parzen_1979].
As mentioned previously, the discontinuities of $F_{Y}$ when $Y$ is discrete represent a mathematical inconvenience. @ma_etal_2011 derived the asymptotic distribution of the sample mid-quantiles, that is, the sample quantiles based on the mid-distribution function (mid-CDF). The latter is defined as $F^{mid}{Y}(y) = F{Y}(y) - 0.5p_{Y}(y)$, where $p_{Y}(y)$ denotes the probability mass function [@parzen_2004]. In particular, they showed that, as $n$ becomes large, $\sqrt{n}\left(\hat{\xi}^{mid}{p} - \xi{p}\right)$ is approximately normal with mean $0$. Under iid assumptions, the expression for the sampling variance is similar to that in (1); see @ma_etal_2011 for details.
The package Qtools provides the functions midecdf and midquantile, which return objects of class "midecdf" or "midquantile", respectively, containing: the values or the probabilities at which mid-cumulative probabilities or mid-quantiles are calculated ($x$), the mid-cumulative probabilities or the mid-quantiles ($y$), and the functions that linearly interpolate those coordinates ($fn$). An example is shown below using data simulated from a Poisson distribution.
> library("Qtools") > set.seed(467) > y <- rpois(1000, 4) > pmid <- midecdf(y) > xmid <- midquantile(y, probs = pmid$y) > pmid Empirical mid-ECDF Call: midecdf(x = y) > xmid Empirical mid-ECDF Call: midquantile(x = y, probs = pmid$y)
A confidence interval for sample mid-quantiles can be obtained using confint.midquantile. This function is applied to the output of midquantile and returns an object of class "data.frame" containing sample mid-quantiles, lower and upper bounds of the confidence intervals of a given level ($95\%$ by default), along with standard errors as an attribute named 'stderr'. This is shown below using the sample $y$ generated in the previous example.
> xmid <- midquantile(y, probs = 1:3/4) > x <- confint(xmid, level = 0.95) > x midquantile lower upper 25% 2.540000 2.416462 2.663538 50% 3.822816 3.693724 3.951907 75% 5.254902 5.072858 5.436946 > attr(x, "stderr") [1] 0.06295447 0.06578432 0.09276875
> par(mfrow = c(1,2)) > plot(pmid, xlab = "y", ylab = "CDF", jumps = TRUE) > points(pmid$x, pmid$y, pch = 15) > plot(xmid, xlab = "p", ylab = "Quantile", jumps = TRUE) > points(xmid$x, xmid$y, pch = 15)
library(Qtools) cx.lab <- .75 cx.ax <- .75 lw <- 1 cx.p <- 1 set.seed(467) y <- rpois(1000, 4) pmid <- midecdf(y); xmid <- midquantile(y, probs = pmid$y) par(mfrow = c(1,2), mar = c(4.1, 4.1, 2.1, 2.1)) plot(pmid, cex.main = cx.lab, cex.axis = cx.ax, cex.lab = cx.lab, xlab = "y", ylab = "CDF", main = "", cex.points = cx.p, jumps = TRUE) points(pmid$x, pmid$y, pch = 15, cex = cx.p, col = 1) mtext("(a)", 3, line = 0.5, cex = cx.lab) plot(xmid, cex.main = cx.lab, cex.axis = cx.ax, cex.lab = cx.lab, xlab = "p", ylab = "Quantile", main = "", cex.points = cx.p, jumps = TRUE) points(xmid$x, xmid$y, pch = 15, cex = cx.p, col = 1) mtext("(b)", 3, line = 0.5, cex = cx.lab)
Finally, a plot method is available for both midecdf and midquantile objects. An illustration is given in Figure 1. The mid-distribution and mid-quantile functions are discrete and their values are marked by filled squares. The piecewise linear functions connecting the filled squares represent continuous versions of the CDF and QF which interpolate between the steps of, respectively, the ordinary CDF and quantile functions. Note that the argument jumps is a logical value indicating whether values at jumps should be marked.
Since the cumulative distribution and quantile functions are two sides of the same coin, the location, scale, and shape (LSS) of a distribution can be examined using one or the other. Well-known quantile-based measures of location and scale are the median and inter-quartile range (IQR), respectively. Similarly, there are also a number of quantile-based measures for skewness and kurtosis [@groeneveld_meeden_1984;@groeneveld_1998;@jones_etal_2011].
Define the 'central' portion of the distribution as that delimited by the quantiles $Q(p)$ and $Q(1-p)$, $0 < p < 0.5$, and define the 'tail' portion as that lying outside these quantiles. Let $IPR(p) = Q(1-p) - Q(p)$ denote the inter-quantile range at level $p$. Building on the results by @horn_1983 and @ruppert_1987, @staudte_2014 considered the following identity: \begin{equation}\label{eq:2} \tag{2} \underbrace{\frac{IPR(p)}{IPR(r)}}\text{kurtosis} = \underbrace{\frac{IPR(p)}{IPR(q)}}\text{tail-weight} \, \cdot \, \underbrace{\frac{IPR(q)}{IPR(r)}}_\text{peakedness}, \end{equation} where $0 < p < q < r < 0.5$. These quantile-based measures of shape are sign, location and scale invariant. As compared to moment-based indices, they are also more robust to outliers and easier to interpret [@groeneveld_1998;@jones_etal_2011].
It is easy to verify that a quantile function can be written as \begin{equation}\label{eq:3} \tag{3} Q(p) = \underbrace{Q(0.5)}\text{median}\,\, + \,\, \frac{1}{2}\underbrace{IPR(0.25)}\text{IQR} \, \cdot \, \underbrace{\frac{IPR(p)}{IPR(0.25)}}\text{shape index} \, \cdot \, \bigg(\underbrace{\frac{Q(p) + Q(1-p) - 2Q(0.5)}{IPR(p)}}\text{skewness index} - 1\bigg). \end{equation} This identity establishes a relationship between the location (median), scale (IQR) and shape of a distribution. (This identity appears in @gilchirst_2000 [p.74] with an error of sign. See also @benjamini_krieger_1996 [eq.1].) The quantity $IPR(p)/IPR(0.25)$ in (3) is loosely defined as the 'shape index' [@gilchirst_2000, p.72], although it can be seen as the tail-weight measure given in (2) when $p < 0.25$. For symmetric distributions, the contribution of the skewness index vanishes. Note that the skewness index not only is location and scale invariant, but is also bounded between $-1$ and $1$ (as opposed to the Pearson's third standardised moment which can be infinite or even undefined). When this index is near the bounds $-1$ or $1$, then $Q(1-p) \approx Q(0.5)$ or $Q(p) \approx Q(0.5)$, respectively.
The function qlss provides a quantile-based LSS summary with the indices defined in (3) of either a theoretical or an empirical distribution. It returns an object of class "qlss", which is a list containing measures of location (median), scale (IQR and IPR), and shape (skewness and shape indices) for each of the probabilities specified in the argument probs (by default, probs = 0.1). The quantile-based LSS summary of the normal distribution is given in the example below for $p =0.1$. The argument fun can take any quantile function whose probability argument is named \samp{p} (this is the case for many standard quantile functions in R, e.g. qt, qchisq, qf, etc.).
> qlss(fun = "qnorm", probs = 0.1) call: qlss.default(fun = "qnorm", probs = 0.1) Unconditional Quantile-Based Location, Scale, and Shape ** Location ** Median [1] 0 ** Scale ** Inter-quartile range (IQR) [1] 1.34898 Inter-quantile range (IPR) 0.1 2.563103 ** Shape ** Skewness index 0.1 0 Shape index 0.1 1.900031
An empirical example is now illustrated using the faithful data set, which contains $272$ observations on waiting time (minutes) between eruptions and the duration (minutes) of the eruption for the Old Faithful geyser in Yellowstone National Park, Wyoming, USA. Summary statistics are given in Table 1.
Table 1: Minimum, maximum and three quartiles (Q1, Q2, Q3) for waiting time and duration in the Old Faithful Geyser data set.Minimum | Q1 | Q2 | Q3 | Maximum | |
Wating time | 43.0 | 58.0 | 76.0 | 82.0 | 96.0 |
Duration | 1.6 | 2.2 | 4.0 | 4.5 | 5.1 |
library(Qtools) gg_color_hue <- function(n) { hues = seq(15, 375, length=n+1) hcl(h=hues, l=65, c=100)[1:n] } y <- faithful$waiting xl <- c(0,1) yl <- c(40,100) cx.lab <- .75 cx.ax <- .75 lw <- 1 cx.p <- 1 cc <- gg_color_hue(2) p <- 1:19/20 xb <- "Waiting time (min)" xmid <- midquantile(y, probs = p) par(mfrow = c(1,2), mar = c(4.1, 4.1, 2.1, 2.1)) plot(density(y), main = "", xlab = xb, cex.lab = cx.lab, cex.axis = cx.ax, lwd = lw) mtext("(a)", 3, line = 0.5, cex = cx.lab) plot(xmid, xlim = xl, ylim = yl, main = "", xlab = "Probability p", ylab = xb, cex.lab = cx.lab, cex.axis = cx.ax, cex = cx.p, pch = 15, lwd = lw, jumps = FALSE) mtext("(b)", 3, line = 0.5, cex = cx.lab)
Suppose the interest is in describing the distribution of waiting times. The density is plotted in Figure 2, along with the mid-quantile function. The distribution is bimodal with peaks at around $54$ and $80$ minutes. Note that the arguments of the base function quantile, including the argument type, can be passed on to qlss.
> y <- faithful$waiting > par(mfrow = c(1,2)) > plot(density(y)) > plot(midquantile(y, probs = p), jumps = FALSE) > qlss(y, probs = c(0.05, 0.1, 0.25), type = 7) call: qlss.numeric(x = y, probs = c(0.05, 0.1, 0.25), type = 7) Unconditional Quantile-Based Location, Scale, and Shape ** Location ** Median [1] 76 ** Scale ** Inter-quartile range (IQR) [1] 24 Inter-quantile range (IPR) 0.05 0.1 0.25 41 35 24 ** Shape ** Skewness index 0.05 0.1 0.25 -0.3658537 -0.4285714 -0.5000000 Shape index 0.05 0.1 0.25 1.708333 1.458333 1.000000
At $p = 0.1$, the skewness index is approximately $-0.43$, which denotes a rather strong left asymmetry. As for the shape index, which is equal to $1.46$, one could say that the tails of this distribution weigh less than those of a normal distribution ($1.90$), though of course a comparison between unimodal and bimodal distributions is not meaningful.
In general, the $p$th linear QR model is of the form \begin{equation}\label{eq:4} \tag{4} Q_{Y|X}(p) = \mathbf{x}^{\top}\boldsymbol \beta(p) \end{equation} where $\mathbf{x}$ is a $k$-dimensional vector of covariates (including $1$ as first element) and $\boldsymbol \beta(p) = [\beta_{0}(p), \beta_{1}(p),$ $\ldots, \beta_{k-1}(p)]^{\top}$ is a vector of coefficients. The 'slopes' $\beta_{j}(p)$, $j = 1,\ldots, k-1$, have the usual interpretation of partial derivatives. For example, in case of the simple model $Q_{Y|X}(p) = \beta_{0}(p) + \beta_{1}(p)x$, one obtains [ \frac{\partial Q_{Y|X}(p)}{\partial x} = \beta_{1}(p).\ ] If $x$ is a dummy variable, then $\beta_{1}(p) = Q_{Y|X = 1}(p) - Q_{Y|X=0}(p)$, i.e.\ the so-called 'quantile treatment effect' [@doksum;@lehmann;@koenker_xiao]. Estimation can be carried out using LP algorithms which, given a sample $(\mathbf{x}{i},y{i})$, $i=1,\dots,n$, solve [ \min_{b \in \mathbb{R}^{k}} \sum_{i=1}^{n} \kappa_{p}\left(y_{i} - \mathbf{x}{i}^{\top}\mathbf{b}\right), ] where $\kappa{p}(u) = u(p - I(u < 0))$, $0 < p < 1$, is the check loss function. Large-$n$ approximation of standard errors can be obtained from the sampling distribution of the linear quantile estimators [@koenker_bassett_1978].
library(Qtools) gg_color_hue <- function(n) { hues = seq(15, 375, length=n+1) hcl(h=hues, l=65, c=100)[1:n] } cutoff <- 3 x <- faithful$eruptions y <- faithful$waiting xl <- c(0,1) yl <- c(40,100) cx.lab <- .75 cx.ax <- .75 lw <- 1 cx.p <- 1 cc <- gg_color_hue(2) p <- 1:19/20 f <- faithful$eruptions >= cutoff xb <- "Waiting time (min)" pmid0 <- midecdf(y[!f]) pmid1 <- midecdf(y[f]) xmid <- midquantile(y, probs = pmid$y) cc <- gg_color_hue(2) par(mfrow = c(1,2), mar = c(4.1, 4.1, 2.1, 2.1)) plot(waiting ~ eruptions, faithful, pch = 1, cex.lab = cx.lab, cex.axis = cx.ax, cex = cx.p, ylab = xb, xlab = "Duration (min)") mtext("(a)", 3, line = 0.5, cex = cx.lab) points(x[!f], y[!f], col = cc[1], cex = cx.p) points(x[f], y[f], col = cc[2], cex = cx.p) abline(v = cutoff, lty = 2, col = grey(5/10)) plot(pmid0, xlim = yl, ylim = xl, xlab = xb, ylab = "CDF", main = "", cex.lab = cx.lab, cex.axis = cx.ax, cex = cx.p, col.midline = cc[1], lty.midline = 1, lwd = lw) plot(pmid1, xlim = yl, ylim = xl, xlab = xb, ylab = "CDF", cex.lab = cx.lab, cex.axis = cx.ax, cex = cx.p, col.midline = cc[2], lwd = lw, add = TRUE) mtext("(b)", 3, line = 0.5, cex = cx.lab)
Waiting times between eruptions are plotted against the durations of the eruptions in Figure 3. Two clusters of observations can be defined for durations below and above 3 minutes [see also @azzalini_bowman_1990]. The distribution shows a strong bimodality as already illustrated in Figure 2. A dummy variable for durations equal to or longer than $3$ minutes is created to define the two distributions and included as covariate $X$ in a model as the one specified in (4). The latter is then fitted to the Old Faithful Geyser data using the function rq in the package quantreg for $p \in {0.1, 0.25, 0.5, 0.75, 0.9}$.
> require("quantreg") > y <- faithful$waiting > x <- as.numeric(faithful$eruptions >= 3) > fit <- rq(formula = y ~ x, tau = c(0.1, 0.25, 0.5, 0.75, 0.9)) > fit Call: rq(formula = y ~ x, tau = c(0.1, 0.25, 0.5, 0.75, 0.9)) Coefficients: tau= 0.10 tau= 0.25 tau= 0.50 tau= 0.75 tau= 0.90 (Intercept) 47 50 54 59 63 x 26 26 26 25 25 Degrees of freedom: 272 total; 270 residual
From the output above, it is quite evident that the distribution of waiting times is shifted by an approximately constant amount at all considered values of $p$. The location-shift hypothesis can be tested by using the Khmaladze test. The null hypothesis is that two distributions, say $F_{0}$ and $F_{1}$, differ by a pure location shift [@koenker_xiao], that is [ H_{0}: \, F^{-1}{1}(p) = F^{-1}{0}(p) + \delta_{0}, ] where $\delta_{0}$ is the quantile treatment effect, constant over $p$. The location--scale-shift specification of the test considers [ H_{0}: \, F^{-1}{1}(p) = \delta{1}F^{-1}{0}(p) + \delta{0}. ] The alternative hypothesis is that the model is more complex than the one specified in the null hypothesis. The Khmaladze test is implemented in quantreg (see ?quantreg::KhmaladzeTest for further details). The critical values of the test and corresponding significance levels [@koenker_2005] are not readily available in the same package. These have been hardcoded in the Qtools function KhmaladzeFormat which can be applied to "KhmaladzeTest" objects. For the Old Faithful Geyser data, the result of the test is not statistically significant at the $10\%$ level.
> kt <- KhmaladzeTest(formula = y ~ x, taus = seq(.05, .95, by = .01), > KhmaladzeFormat(kt, 0.05) Khmaladze test for the location-shift hypothesis Joint test is not significant at 10% level Test(s) for individual slopes: not significant at 10% level
Distribution-free quantile regression does not require introducing an assumption on the functional form of the error distribution [@koenker_bassett_1978], but only weaker quantile restrictions [@powell_1994]. Comparatively, the linear specification of the conditional quantile function in Equation 4 is a much stronger assumption and thus plays an important role for inferential purposes.
The problem of assessing the goodness of fit (GOF) is rather neglected in applications of QR. Although some approaches to GOF have been proposed [@zheng_1998;@koenker_machado_1999;@he_zhu_2003;@khmaladze_etal_2004], there is currently a shortage of software code available to users. The function GOFTest implements a test based on the cusum process of the gradient vector [@he_zhu_2003]. Briefly, the test statistic is given by the largest eigenvalue of [ n^{-1}\sum_{i}^{n} \mathbf{R}{n}(\mathbf{x}{i})\mathbf{R}^{\top}{n}(\mathbf{x}{i}) ] where $\mathbf{R}{n}(\mathbf{t}) = n^{-1/2} \sum{j=1}^{n} \psi_{p}(r_{j})\mathbf{x}{j} I(\mathbf{x}{j} \leq \mathbf{t})$ is the residual cusum (RC) process and $\psi_{p}(r_{j})$ is the derivative of the loss function $\kappa_{p}$ calculated for residual $r_{j} = y_{j} - \mathbf{x}_{j}^{\top}\boldsymbol \beta(p)$. The sampling distribution of this test statistic is non-normal [@he_zhu_2003] and a resampling approach is used to obtain the $p$-value under the null hypothesis.
An example is provided further below using the New York Air Quality data set, which contains $111$ complete observations on daily mean ozone (parts per billion -- ppb) and solar radiation (Langleys -- Ly). For simplicity, wind speed and maximum daily temperature, also included in the data set, are not analysed here.
Suppose that the model of interest is \begin{equation}\label{eq:5} \tag{5} Q_{\text{ozone}}(p) = \beta_{0}(p) + \beta_{1}(p) \cdot \text{Solar.R}. \end{equation} Three conditional quantiles ($p \in {0.1,0.5,0.9}$) are estimated and plotted using the following code:
> dd <- airquality[complete.cases(airquality), ] > dd <- dd[order(dd$Solar.R), ] > fit.rq <- rq(Ozone ~ Solar.R, tau = c(.1, .5, .9), data = dd) > x <- seq(min(dd$Solar.R), max(dd$Solar.R), length = 200) > yhat <- predict(fit.rq, newdata = data.frame(Solar.R = x)) > plot(Ozone ~ Solar.R, data = dd) > apply(yhat, 2, function(y,x) lines(x,y), x = x)
library(Qtools) library(quantreg) cx.lab <- .75 cx.ax <- .75 lw <- 1 cx.p <- 1 dd <- airquality[complete.cases(airquality),] dd <- dd[order(dd$Solar.R),] x <- seq(min(dd$Solar.R), max(dd$Solar.R), length = 200) yhat <- predict(rq(Ozone ~ Solar.R , tau = c(.1,.5,.9), data = dd), newdata = data.frame(Solar.R = x)) par(mar = c(4.1, 4.1, 2.1, 2.1)) plot(Ozone ~ Solar.R, data = dd, xlab = "Solar radiation (lang)", ylab = "Ozone (ppb)", cex.lab = cx.lab, cex.axis = cx.ax, cex = cx.p) for(i in 1:3){ lines(x[-1], yhat[-1,i], lty = c(1,2,4)[i], lwd = lw) }
As a function of solar radiation, the median of the ozone daily averages increases by $0.09$ ppb for each Ly increase in solar radiation (Figure 4). The 90th centile of conditional ozone shows a steeper slope at $0.39$ ppb/Ly, about nine times larger than the slope of the conditional $10$th centile at $0.04$ ppb/Ly.
The RC test applied to the the object fit.rq provides evidence of lack of fit for all quantiles considered, particularly for $p = 0.1$ and $p = 0.5$. Therefore the straight-line model in Equation 5 for these three conditional quantiles does not seem to be appropriate. The New York Air Quality data set will be analysed again in the next section, where a transformation-based approach to nonlinear modelling is discussed.
> gof.rq <- GOFTest(fit.rq, alpha = 0.05, B = 1000, seed = 987) > gof.rq Goodness-of-fit test for quantile regression based on the cusum process Quantile 0.1: Test statistic = 0.1057; p-value = 0.001 Quantile 0.5: Test statistic = 0.2191; p-value = 0 Quantile 0.9: Test statistic = 0.0457; p-value = 0.018
Complex dynamics may result in nonlinear effects in the relationship between the covariates and the response variable. For instance, in kinesiology, pharmacokinetics, and enzyme kinetics, the study of the dynamics of an agent in a system involves the estimation of nonlinear models; phenomena like human growth, certain disease mechanisms and the effects of harmful environmental substances such as lead and mercury, may show strong nonlinearities over time. In this section, the linear model is abandoned in favor of a more general model of the type \begin{equation}\label{eq:6} \tag{6} Q_{Y|X}(p) = g\left{\mathbf{x}^{\top}\boldsymbol \beta(p)\right}, \end{equation} for some real-valued function $g$. If $g$ is nonlinear, the alternative approaches to conditional quantile modelling are
The focus of this section is on the third approach. More specifically the functions available in Qtools refer to the methods for transformation-based QR models developed by @powell_1991, @chamberlain_1994, @mu_he, @dehbi_etal and @geraci_jones_2015. Examples of approaches to nonlinear QR based on parametric models or splines can be found in @koenker_park_1996 and @yu_jones, respectively.
The goal of transformation-based QR is to fit the model \begin{equation}\label{eq:7} \tag{7} Q_{h\left(Y;\lambda_{p}\right)}(p) = \mathbf{x}^{\top}\boldsymbol \beta(p). \end{equation} The assumption is that the transformation $h$ is the inverse of $g$, $h(Y; \lambda_{p}) \equiv g^{-1}(Y)$, so that the $p$th quantile function of the transformed response variable is linear. (In practice, it is satisfactory to achieve approximate linearity.) The parameter $\lambda_{p}$ is a low-dimensional parameter that gives some flexibility to the shape of the transformation and is estimated from the data. In general, the interest is on predicting $Q_{Y|X}(p)$ and estimating the effects of the covariates on $Q_{Y|X}(p)$. If $h$ is a non-decreasing function on $\mathbb{R}$ (as is the case for all transformations considered here), predictions can be easily obtained from (7) by virtue of the equivariance property of quantiles, \begin{equation}\label{eq:8} \tag{8} Q_{Y|X}(p) = h^{-1}\left{\mathbf{x}^{\top}\boldsymbol \beta(p); \lambda_{p}\right}. \end{equation} The marginal effect of the $j$th covariate $x_{j}$ can be obtained by differentiating the quantile function $Q_{Y|X}(p)$ with respect to $x_{j}$. This can be written as the derivative of the composition $Q \circ \eta$, i.e.\ \begin{equation}\label{eq:9} \tag{9} \frac{\partial Q(p)}{\partial x_{j}} = \frac{\partial Q(p)}{\partial \eta(p)} \cdot \frac{\partial \eta(p)}{\partial x_{j}}, \end{equation} $\eta(p) = \mathbf{x}^{\top}\boldsymbol \beta(p)$. Once the estimates $\hat{\boldsymbol{\beta}}(p)$ and $\hat{\lambda}_{p}$ are obtained, these can be plugged in Equations 8 and 9.
The package Qtools provides several transformation families, namely the Box--Cox [@box_cox], Aranda-Ordaz [@aranda_1981], and Jones [@jones_2007;@geraci_jones_2015] transformations. A distinction between these families is made in terms of the support of the response variable to which the transformation is applied and the number of transformation parameters. The Box--Cox model is a one-parameter family of transformations which applies to singly bounded variables, $y > 0$. The Aranda-Ordaz symmetric and asymmetric transformations too have one parameter and are used when responses are bounded on the unit interval, $0 < y < 1$ (doubly bounded). @geraci_jones_2015 developed two families of transformations which can be applied to either singly or doubly bounded responses:
Originally, @box_cox proposed using power transformations to address lack of linearity, homoscedasticity and normality of the residuals in mean regression modelling. \citeauthor{sakia} (\citeyear[][p.175]{sakia}) reported that ``seldom does this transformation fulfil the basic assumptions of linearity, normality and homoscedasticity simultaneously as originally suggested by Box \& Cox (1964). The Box-Cox transformation has found more practical utility in the empirical determination of functional relationships in a variety of fields, especially in econometrics''.
Indeed, the practical utility of power transformations has been long recognised in QR modelling [@powell_1991;@buchinsky_1995;@chamberlain_1994;@mu_he]. Model 7 is the Box--Cox QR model if \begin{equation}\label{eq:10} \tag{10} h_{BC}\left(Y;\lambda_{p}\right) = \begin{cases} \dfrac{Y^{\lambda_{p}} - 1}{\lambda_{p}} & \text{if $\lambda_{p} \neq 0$}\[.5cm] \log Y & \text{if $\lambda_{p} = 0$}. \end{cases} \end{equation} Note that when $\lambda_{p} \neq 0$, the range of this transformation is not $\mathbb{R}$ but the singly bounded interval $(-1/\lambda_{p},\infty)$. This implies that the inversion in (8) is defined only for $\lambda_{p} \mathbf{x}^{\top}\boldsymbol \beta(p) + 1 > 0$.
The 'symmetric' Aranda-Ordaz transformation is given by \begin{equation}\label{eq:11} \tag{11} h_{AOs}\left(Y;\lambda_{p}\right) = \begin{cases} \dfrac{2}{\lambda_{p}} \quad \dfrac{Y^{\lambda_{p}} - \left(1-Y\right)^{\lambda_{p}}}{Y^{\lambda_{p}} + \left(1-Y\right)^{\lambda_{p}}}& \text{if $\lambda_{p} \neq 0$},\[.5cm] \log\left(\dfrac{Y}{1-Y}\right) & \text{if $\lambda_{p} = 0$}. \end{cases} \end{equation} (The symmetry here is that $h_{AOs}(\theta;\lambda_p) = -h_{AOs}(1-\theta;\lambda_p)=h_{AOs}(\theta;-\lambda_p)$.) There is a range problem with this transformation too since, for all $\lambda_p \neq 0$, the range of $h_{AOs}$ is not $\mathbb{R}$, but $(-2/|\lambda_{p}|, 2/|\lambda_{p}|)$. The 'asymmetric' Aranda-Ordaz transformation is given by \begin{equation}\label{eq:12} \tag{12} h_{AOa}\left(Y;\lambda_{p}\right) = \begin{cases} \log \left{\dfrac{\left(1-Y\right)^{-\lambda_{p}} - 1}{\lambda_{p}} \right}& \text{if $\lambda_{p} \neq 0$},\[.5cm] \log \left{-\log\left(1 - Y\right)\right} & \text{if $\lambda_{p} = 0$}. \end{cases} \end{equation} For $\lambda_{p} = 0$, this is equivalent to the complementary log-log. The asymmetric Aranda-Ordaz transformation does have range $\mathbb{R}$. Note that $h_{AOa}(Y;1) = \log (Y/(1-Y))$, i.e.\ the transformation is symmetric.
To overcome range problems, which give rise to computational difficulties, @geraci_jones_2015 proposed to use instead one-parameter transformations with range $\mathbb{R}$. Proposal I is written in terms of the variable (say) $W$, where \begin{equation}\label{eq:13} \tag{13} h_I\left(W;\lambda_{p}\right) = \begin{cases} \dfrac{1}{2\lambda_{p}}\left(W^{\lambda_{p}} - \dfrac{1}{W^{\lambda_{p}}}\right)& \text{if $\lambda_{p} \neq 0$}\[.5cm] \log W & \text{if $\lambda_{p} = 0$}, \end{cases} \end{equation} which takes on four forms depending on the relationship of $W$ to $Y$, as described in Table 2. For each of domains $(0,\infty)$ and $(0,1)$, there are symmetric and asymmetric forms.
Table 2: Choices of $W$ and corresponding notation for transformations based on (13).Support of $Y$ | Symmetric | Asymmetric |
$(0,\infty)$ | $W= Y$ | $W= \log(1+Y)$ |
$h_{Is}(Y;\lambda_p)$ | $h_{Ia}(Y;\lambda_p)$ | |
$(0,1)$ | $W= Y/(1-Y)$ | $W= -\log(1-Y)$ |
$h_{Is}(Y;\lambda_p)$ | $h_{Ia}(Y;\lambda_p)$ |
Since the transformation in (13) has range $\mathbb{R}$ for all $\lambda_{p}$, it admits an explicit inverse transformation. In addition, in the case of a single covariate, every estimated quantile that results will be monotone increasing, decreasing or constant, although different estimated quantiles can have different shapes from this collection. @geraci_jones_2015 also proposed a transformation that unifies the symmetric and asymmetric versions of $h_{I}$ into a single two-parameter transformation, namely \begin{equation}\label{eq:14} \tag{14} h_{II}(W;\lambda_p) = h_I(W_{\delta_p};\lambda_p), \end{equation} where $h_I$ is given in (13) and \begin{equation} W_{\delta_{p}} = h_{BC}(1+W;\delta_p) = \begin{cases} \dfrac{(1+W)^{\delta_{p}} - 1}{\delta_{p}} & \text{if $\delta_{p} > 0$}\[.5cm] \log (1+W) & \text{if $\delta_{p} = 0$}, \end{cases} \end{equation} with $W=Y$, if $Y > 0$, and $W=Y/(1-Y)$, if $Y \in (0,1)$. The additional parameter $\delta_{p}$ controls the asymmetry: symmetric forms of $h_{I}$ correspond to $\delta_p = 1$ while asymmetric forms of $h_{I}$ to $\delta_p = 0$.
All transformation models discussed above can be fitted using a two-stage (TS) estimator [@chamberlain_1994;@buchinsky_1995] whereby $\boldsymbol \beta(p)$ is estimated conditionally on a fine grid of values for the transformation parameter(s). Alternatively, point estimation can be approached using the RC process [@mu_he], which is akin to the process that leads to the RC test introduced in the previous section. The RC estimator avoids the troublesome inversion of the Box-Cox and Aranda-Ordaz transformations, but it is computationally more intensive than the TS estimator.
There are several methods for interval estimation, including those based on large-$n$ approximations and the ubiquitous bootstrap. Both the TS and RC estimators have an asymptotic normal distribution. The large-sample properties of the TS estimator for monotonic quantile regression models have been studied by @powell_1991 (see also @chamberlain_1994;@machado_mata). Under regularity conditions, it can be shown that the TS estimator is unbiased and will converge to a normal distribution with a sandwich-type limiting covariance matrix which is easy to calculate. In contrast, the form of the covariance matrix of the sampling distribution for the RC estimator is rather complicated and its estimation requires resampling [@mu_he]. Finally, if the transformation parameter is assumed to be known, then conditional inference is apposite. In this case, the estimation procedures simplify to those for standard quantile regression problems.
In Qtools, model fitting for one-parameter transformation models can be carried out using the function tsrq. The formula argument specifies the model for the linear predictor as in (7), while the argument tsf provides the desired transformation $h$ as specified in Equations 10-13: bc for the Box--Cox model, ao for Aranda-Ordaz families, and mcjI for proposal I transformations. Additional arguments in the function tsrq include
There are other functions to fit transformation models. The function rcrq fits one-parameter transformation models using the RC estimator. The functions tsrq2 and nlrq2 are specific to Geraci and Jones's (2015) Proposal II transformations. The former employs a two-way grid search while the latter is based on Nelder-Mead optimization as implemented in optim. Simulation studies in @geraci_jones_2015 suggest that, although computationally slower, a two grid search is numerically more stable than the derivative-free approach.
A summary of the basic differences between all fitting functions is given in Table 3. The table also shows the available methods in summary.rqt to estimate standard errors and confidence intervals for the model's parameters. Unconditional inference is carried out jointly on $\boldsymbol \beta(p)$ and the transformation parameter by means of bootstrap using the package boot [@boot;@davison_hinkley]. Large-$n$ approximations [@powell_1991;@chamberlain_1994;@machado_mata] are also available for the one-parameter TS estimator under iid or nid assumptions.
When summary.rqt is executed with the argument conditional = TRUE, confidence interval estimation for $\boldsymbol \beta_{p}$ is performed with one of the several methods developed for linear quantile regression estimators \citep[][p.110]{koenker_2005} (see options rank, iid, nid, ker, and boot in quantreg::summary.rq).
Function name | Transformation parameters | Estimation | Standard errors or confidence intervals | ||||||
Unconditional | Conditional | ||||||||
tsrq | 1 | Two-stage | iid, nid, boot | All types | |||||
rcrq | 1 | Residual cusum process | boot | All types | |||||
tsrq2 | 2 | Two-stage | boot | All types | |||||
nlrq2 | 2 | Nelder--Mead | boot | -- | Table 3: Transformation-based quantile regression in package **Qtools**. 'All types' consists of options rank, iid, nid, ker, and boot as provided by function summary in package **quantreg**. |
In the New York Air Quality data example, a linear model was found unsuitable to describe the relationship between ozone and solar radiation. At closer inspection, Figure 4 reveals that the conditional distribution of ozone may in fact be nonlinearly associated with solar radiation, at least for some of the conditional quantiles. The model \begin{equation}\label{eq:15} \tag{15} Q_{h_{Is}{\text{ozone}}}(p) = \beta_{0}(p) + \beta_{1}(p) \cdot \text{Solar.R}, \end{equation} where $h_{Is}$ denotes the symmetric version of (13) for a singly bounded response variable, is fitted for the quantiles $p \in {0.1,0.5,0.9}$ using the following code:
> system.time(fit.rqt <- tsrq(Ozone ~ Solar.R, data = dd, tsf = "mcjI", + symm = TRUE, dbounded = FALSE, lambda = seq(1, 3, by = 0.005), + conditional = FALSE, tau = c(.1, .5, .9))) user system elapsed 0.5 0.0 0.5 > fit.rqt call: tsrq(formula = Ozone ~ Solar.R, data = dd, tsf = "mcjI", symm = TRUE, dbounded = FALSE, lambda = seq(1, 3, by = 0.005), conditional = FALSE, tau = c(0.1, 0.5, 0.9)) Proposal I symmetric transformation (singly bounded response) Optimal transformation parameter: tau = 0.1 tau = 0.5 tau = 0.9 2.210 2.475 1.500 Coefficients linear model (transformed scale): tau = 0.1 tau = 0.5 tau = 0.9 (Intercept) -3.3357578 -48.737341 16.557327 Solar.R 0.4169697 6.092168 1.443407 Degrees of freedom: 111 total; 109 residual
The TS estimator makes a search for $\lambda_{p}$ over the grid $1.000, 1.005, \ldots, 2.995, 3.000$. The choice of the search interval usually results from a compromise between accuracy and performance: the coarser the grid, the faster the computation but the less accurate the estimate. A reasonable approach would be to start with a coarse, wide-ranging grid (e.g. seq(-5, 5, by = 0.5)), then center the interval about the resulting estimate using a finer grid, and re-fit the model.
The output above reports the estimates $\boldsymbol{\hat{\beta}}(p)$ and $\hat{\lambda}_p$ for each quantile level specified in tau. Here, the quantities of interest are the predictions on the ozone scale and the marginal effect of solar radiation, which can obtained using the function predict.rqt.
> x <- seq(9, 334, length = 200) > qhat <- predict(fit.rqt, newdata = data.frame(Solar.R = x), + type = "response") > dqhat <- predict(fit.rqt, newdata = data.frame(Solar.R = x), + type = "maref", namevec = "Solar.R") The linear component of the marginal effect is calculated as derivative of Ozone ~ beta1 * Solar.R with respect to Solar.R
The calculations above are based on a sequence of 200 ozone values in the interval $[9,334]$ Ly, as provided via the argument newdata (if this argument is missing, the function returns the fitted values). There are three types of predictions available:
In the latter case, the argument namevec is used to specify the name of the covariate with respect to which the marginal effect has to be calculated. The function maref.rqt computes derivatives symbolically using the stats function deriv and these are subsequently evaluated numerically. While the nonlinear component of the marginal effect in Equation 9 (i.e.\ $\partial Q(p)/\partial \eta(p)$) is rather straightforward to derive for any of the transformations (10)-(13), the derivative of the linear predictor (i.e.\ $\partial \eta(p)/\partial x_{j}$) requires parsing the formula argument in order to obtain an expression suitable for deriv. The function maref.rqt can handle simple expressions with common functions like log, exp, etc., interaction terms, and `as is' terms (i.e. I()). However, using functions that are not recognised by deriv will trigger an error.
library(Qtools) cx.lab <- .75 cx.ax <- .75 lw <- 1 cx.p <- 1 dd <- airquality[complete.cases(airquality),] dd <- dd[order(dd$Solar.R),] x <- seq(9, 334, length = 200) set.seed(567) par(mfrow = c(1,2), mar = c(4.1, 4.1, 2.1, 2.1)) # mcjI fit.rqt <- tsrq(Ozone ~ Solar.R, tsf = "mcjI", lambda = seq(1,3,by=0.005), tau = c(.1,.5,.9), data = dd, symm = TRUE) qhat <- predict(fit.rqt, newdata = data.frame(Solar.R = x), type = "response") plot(Ozone ~ Solar.R, data = dd, xlab = "Solar radiation (Ly)", ylab = "Ozone (ppb)", cex.lab = cx.lab, cex.axis = cx.ax, cex = cx.p, col = "gray70") for(i in 1:3){lines(x, qhat[,i], lty = c(1,2,4)[i], lwd = lw)} mtext("(a)", 3, line = 0.5, cex = cx.lab) # marginal dqhat <- predict(fit.rqt, newdata = data.frame(Solar.R = x), type = "maref", namevec = "Solar.R") plot(range(x), range(dqhat), type = "n", xlab = "Solar radiation (Ly)", ylab = "Marginal effect", main = "", cex.lab = cx.lab, cex.axis = cx.ax, cex = cx.p) for(i in 1:3){lines(x, dqhat[,i], lty = c(1,2,4)[i], lwd = lw)} mtext("(b)", 3, line = 0.5, cex = cx.lab)
The predicted quantiles of ozone and the marginal effects of solar radiation are plotted in Figure 5 using the following code:
> par(mfrow = c(1, 2)) > plot(Ozone ~ Solar.R, data = dd, xlab = "Solar radiation (lang)", + ylab = "Ozone (ppb)") > for(i in 1:3) lines(x, qhat[ ,i], lty = c(1, 2, 4)[i], lwd = 2) > plot(range(x), range(dqhat), type = "n", xlab = "Solar radiation (lang)", + ylab = "Marginal effect") > for(i in 1:3) lines(x, dqhat[ ,i], lty = c(1, 2, 4)[i], lwd = 2)
The effect of solar radiation on different quantiles of ozone levels shows a nonlinear behavior, especially at lower ranges of radiation (below $50$ Ly) and on the median ozone. It might be worth testing the goodness-of-fit of the model. In the previous analysis, it was found evidence of lack of fit for the linear specification (5). In contrast, the output reported below indicates that, in general, the goodness of fit of the quantile models based on the transformation model (15) has improved since the test statistics are now smaller at all values of $p$. However, such improvement is not yet satisfactory for the median.
> GOFTest(fit.rqt, alpha = 0.05, B = 1000, seed = 416) Goodness-of-fit test for quantile regression based on the cusum process Quantile 0.1: Test statistic = 0.0393; p-value = 0.025 Quantile 0.5: Test statistic = 0.1465; p-value = 0.005 Quantile 0.9: Test statistic = 0.0212; p-value = 0.127
The TS and RC estimators generally provide similar estimates and predictions. However, computation based on the cusum process tends to be somewhat slow, as shown further below. This is also true for the RC test provided by GOFTest.
> system.time(fit.rqt <- rcrq(Ozone ~ Solar.R, data = dd, tsf = "mcjI", + symm = TRUE, dbounded = FALSE, lambda = seq(1, 3, by = 0.005), + tau = c(.1, .5, .9))) user system elapsed 36.88 0.03 37.64
An example using doubly bounded transformations is demonstrated using the A-level Chemistry Scores data set. The latter is available from Qtools and it consists of 31022 observations of A-level scores in Chemistry for England and Wales students, 1997. The data set also includes information of prior academic achievement as assessed with General Certificate of Secondary Education (GCSE) average scores. The goal is to evaluate the ability of GCSE to predict A-level scores. The latter are based on national exams in specific subjects (e.g.\ chemistry) with grades ranging from A to F. For practical purposes, scores are converted numerically as follows: A = 10, B = 8, C = 6, D = 4, E = 2, and F = 0. The response is therefore doubly bounded between 0 ad 10. It should be noted that this variable is discrete, although, for the sake of simplicity, here it is assumed that the underlying process is continuous.
The model considered here is \begin{equation}\label{eq:16} \tag{16} Q_{h_{AOa}{\text{score}}}(p) = \beta_{0}(p) + \beta_{1}(p) \cdot \text{gcse}, \end{equation} where $h_{AOa}$ denotes the asymmetric Aranda-Ordaz transformation in (12). This model is fitted for $p = 0.9$:
> data(Chemistry) > fit.rqt <- tsrq(score ~ gcse, data = Chemistry, tsf = "ao", symm = FALSE, + lambda = seq(0,2,by=0.01), tau = 0.9)
library(Qtools) cx.lab <- .75 cx.ax <- .75 lw <- 1 cx.p <- 1 data(Chemistry) #fit.rqt <- tsrq(score ~ gcse, data = Chemistry, tsf = "ao", symm = FALSE, lambda = seq(0,2,by=0.01), tau = 0.9) x <- seq(0, 8, length = 200) #qhat <- predict(fit.rqt, newdata = data.frame(gcse = x), type = "response") #dqhat <- predict(fit.rqt, newdata = data.frame(gcse = x), type = "maref", namevec = "gcse") par(mfrow = c(1,2), mar = c(4.1, 4.1, 2.1, 2.1)) #with(Chemistry, plot(score ~ gcse, xlab = "GCSE score", ylab = "A-level Chemistry score", cex.lab = cx.lab, cex.axis = cx.ax, cex = cx.p, pch = 1, col = "gray70")) #lines(x, qhat, lty = 1, lwd = lw) #mtext("(a)", 3, line = 0.5, cex = cx.lab) # marginal #plot(x, dqhat, type = "l", xlab = "GCSE score", ylab = "Marginal effect", cex.lab = cx.lab, cex.axis = cx.ax, lwd = lw) #mtext("(b)", 3, line = 0.5, cex = cx.lab)
The predicted $90$th centile of A-level scores conditional on GCSE and the marginal effect of GCSE are plotted in Figure 6. There is clearly a positive, nonlinear association between the two scores. The nonlinearity is partly explained by the floor and ceiling effects which result from the boundedness of the measurement scale. Note, however, that the S-shaped curve is not symmetric about the inflection point. As a consequence, the marginal effect is skewed to the left. Indeed, the estimate $\hat{\lambda}_{0.9} = 0$ and the narrow confidence interval give support to a complementary log-log transformation:
> summary(fit.rqt, conditional = FALSE, se = "nid") call: summary.rqt(object = fit.rqt, se = "nid", conditional = FALSE) Aranda-Ordaz asymmetric transformation (doubly bounded response) Summary for unconditional inference tau = 0.9 Optimal transformation parameter: Value Std. Error Lower bound Upper bound 0.000000000 0.001364422 -0.002674218 0.002674218 Coefficients linear model (transformed scale): Value Std. Error Lower bound Upper bound (Intercept) -4.3520060 0.015414540 -4.3822179 -4.3217941 gcse 0.8978072 0.002917142 0.8920898 0.9035247 Degrees of freedom: 31022 total; 31020 residual
Alternatively, one can estimate the parameter $\delta_{p}$ using a two-parameter transformation:
> coef(tsrq2(score ~ gcse, data = chemsub, dbounded = TRUE, + lambda = seq(0, 2, by = 0.1), delta = seq(0, 2, by = 0.1), + tau = 0.9), all = TRUE) (Intercept) gcse lambda delta -4.1442274 0.8681246 0.0000000 0.0000000
These results confirm the asymmetric nature of the relationship since $\hat{\delta}_{0.9} = 0$. Similar results (not shown) were obtained with nlrq2.
In conclusion, the package Qtools offers several options in terms of transformations and estimation algorithms, the advantages and disadvantages of which are discussed by \cite{geraci_jones_2015}. In particular, they found that the symmetric Proposal I transformation improves considerably on the Box-Cox method and marginally on the Aranda-Ordaz transformation in terms of mean squared error of the predictions. Also, asymmetric transformations do not seem to improve sufficiently often on symmetric transformations to be especially recommendable. However, the Box-Cox and the symmetric Aranda-Ordaz transformations should not be used when individual out-of-range predictions represent a potential inconvenience as, for example, in multiple imputation (see section further below). Finally, in some situations transformation-based quantile regression may be competitive as compared to methods based on smoothing, as demonstrated by a recent application to anthropometric charts [@boghossian_etal].
Quantile-based measures of location, scale, and shape can be assessed conditionally on covariates. A simple approach is to a fit a linear model as in (4) or a transformation-based model as in (7), and then predict $\hat{Q}_{Y|X}(p)$ to obtain the conditional LSS measures in Equation 3 for specific values of $\mathbf{x}$.
Estimation of conditional LSS can be carried out by using the function qlss.formula. The conditional model is specified in the argument formula, while the probability $p$ is given in probs. (As seen in Equation 3, the other probabilities of interest to obtain the decomposition of the conditional quantiles are $1-p$, $0.25$, $0.5$, and $0.75$.) The argument type specifies the required type of regression model, more specifically rq for linear models and rqt for transformation-based models. The function qlss.formula will take any additional argument to be passed to quantreg::rq or tsrq (e.g. subset, weights, etc.).
Let's consider the New York Air Quality data example discussed in the previous section and assume that the transformation model (15) holds for the quantiles $p \in {0.05, 0.1, 0.25, 0.5, 0.75,$ $0.9, 0.95}$. Then the conditional LSS summary of the distribution of ozone conditional on solar radiation for $p = 0.05$ and $p = 0.1$ is calculated as follows:
> fit.qlss <- qlss(formula = Ozone ~ Solar.R, data = airquality, type = + "rqt", tsf = "mcjI", symm = TRUE, dbounded = FALSE, lambda = + seq(1, 3, by = 0.005), probs = c(0.05, 0.1)) > fit.qlss call: qlss.formula(formula = Ozone ~ Solar.R, probs = c(0.05, 0.1), data = airquality, type = "rqt", tsf = "mcjI", symm = TRUE, dbounded = FALSE, lambda = seq(1, 3, by = 0.005)) Conditional Quantile-Based Location, Scale, and Shape -- Values are averaged over observations -- ** Location ** Median [1] 30.2258 ** Scale ** Inter-quartile range (IQR) [1] 43.40648 Inter-quantile range (IPR) 0.05 0.1 88.02909 73.93430 **Shape** Skewness index 0.05 0.1 0.5497365 0.5180108 Shape index 0.05 0.1 1.960315 1.661648
The output, which is of class "qlss", is a named list with the same LSS measures seen in the case of unconditional quantiles. However, these are now conditional on solar radiation. By default, the predictions are the fitted values, which are averaged over observations for printing purposes. An optional data frame for predictions can be given via the argument newdata in predict.qlss. If interval = TRUE, the latter computes confidence intervals at the specified level using R bootstrap replications (it is, therefore, advisable to set the seed before calling predict.qlss). The conditional LSS measures can be conveniently plotted using the plot.qlss function as shown in the code below. The argument z is required and specifies the covariate used for plotting. Finally, the argument whichp specifies one probability (and one only) among those given in probs that should be used for plotting (e.g.\ $p = 0.1$ in the following example).
> set.seed(567) > x <- seq(9, 334, length = 200) > qhat <- predict(fit.qlss, newdata = data.frame(Solar.R = x), + interval = TRUE, level = 0.90, R = 500) > plot(qhat, z = x, whichp = 0.1, interval = TRUE, type = "l", + xlab = "Solar radiation (lang)", lwd = 2)
Figure 7 shows that both the median and the IQR of ozone increase nonlinearly with increasing solar radiation. The distribution of ozone is skewed to the right and the degree of asymmetry is highest at low values of solar radiation. This is due to the extreme curvature of the median which takes on values close to the 10th centile (Figure 5). (Recall that the index approaches 1 when $Q(p) \approx Q(0.5)$.) However, the sparsity of observations at the lower end of the observed range of solar radiation determines substantial uncertainty as reflected by the wider confidence interval (Figure 7). At $p = 0.1$, the conditional shape index is on average equal to $1.66$ and it increases monotonically from $1.32$ to about $1.85$, remaining always below the tail-weight threshold of a normal distribution ($1.90$).
library(Qtools) cx.lab <- 1 cx.ax <- 1 lw <- 1.5 cx.p <- 1 fit.qlss <- qlss(formula = Ozone ~ Solar.R, data = airquality, type = "rqt", tsf = "mcjI", symm = TRUE, dbounded = FALSE, lambda = seq(1, 3, by = 0.005), probs = c(0.05, 0.1)) set.seed(567) x <- seq(9, 334, length = 200) qhat <- predict(fit.qlss, newdata = data.frame(Solar.R = x), interval = TRUE, level = 0.90, R = 500) par(mfrow = c(1,2), mar = c(4.1, 4.1, 2.1, 2.1)) plot(qhat, z = x, whichp = 0.1, interval = TRUE, type = "l", xlab = "Solar radiation (lang)", lwd = lw, cex.lab = cx.lab, cex.axis = cx.ax)
Besides a loss of precision, high sparsity (low density) might also lead to a violation of the basic property of monotonicity of quantile functions. Quantile crossing occurs when $\mathbf{x}{i}^{\top}\hat{\boldsymbol \beta}(p) > \mathbf{x}{i}^{\top}\hat{\boldsymbol \beta}(p')$ for some $\mathbf{x}_{i}$ and $p < p'$. This problem typically occurs in the outlying regions of the design space [@koenker_2005] where also sparsity occurs more frequently. Balanced designs with larger sample sizes would then offer some assurance against quantile crossing, provided, of course, that the QR models are correctly specified. Model's misspecification, indeed, can still be a cause of crossing of the quantile curves. Restricted regression quantiles (RRQ) [@he_1997] might offer a practical solution when little can be done in terms of modelling. This approach applies to a subclass of linear models [ Y = \mathbf{x}^{\top}\beta + \epsilon ] and linear heteroscedastic models [ Y = \mathbf{x}^{\top}\beta + (\mathbf{x}^{\top}\gamma)\,\epsilon, ] where $\mathbf{x}^{\top}\gamma > 0$ and $\epsilon \sim F$. Basically, it consists in fitting a reduced regression model passing through the origin. The reader is referred to \cite{he_1997} for details. Here, it is worth stressing that when the restriction does not hold, i.e.\ if the model is more complex than a location--scale-shift model, then RRQ may yield unsatisfactory results \cite{he_1997}. See also \cite{zhao_2000} for an examination of the asymptotic properties of the restricted QR estimator. In particular, the relative efficiency of RRQ as compared to RQ depends on the error distribution. For some common unimodal distributions, \cite{zhao_2000} showed that RRQ in iid models is more efficient than RQ. This property is lost when the error is asymmetric. In contrast, the efficiency of RRQ in heteroscedastic models is comparable to that of RQ even for small samples.
The package Qtools provides the functions rrq, rrq.fit and rrq.wfit which are, respectively, the 'restricted' analogous of rq, rq.fit, and rq.wfit in quantreg. S3 methods print, coef, predict, fitted, residuals, and summary are available for objects of class "rrq". In particular, confidence intervals are obtained using the functions boot and boot.ci from package boot. Future versions of the package will develop the function summary.rrq to include asymptotic standard errors [@zhao_2000]. An application is shown below using an example discussed by \cite{zhao_2000}. The data set, available from Qtools, consists of $118$ measurements of esterase concentrations and number of bindings counted in binding experiments.
> data("esterase") > taus <- c(.1, .25, .5, .75, .9) > fit.rq <- rq(Count ~ Esterase, data = esterase, tau = taus) > yhat1 <- fitted(fit.rq) > fit.rrq <- rrq(Count ~ Esterase, data = esterase, tau = taus) > yhat2 <- fitted(fit.rrq)
The predicted 90th centile curve crosses the 50th and 75th curves at lower esterase concentrations (Figure 8). The crossing is removed in predictions based on RRQs.
library(Qtools) library(quantreg) data(esterase) cx.lab <- .75 cx.ax <- .75 lw <- 1 cx.p <- 1 par(mfrow = c(1,2), mar = c(4.1, 4.1, 2.1, 2.1)) fit <- quantreg::rq(Count ~ Esterase, data = esterase, tau = c(.1,.25,.5,.75,.9)) yhat <- fit$fitted.values fitr <- rrq(Count ~ Esterase, data = esterase, tau = c(.1,.25,.5,.75,.9)) yhat2 <- predict(fitr) # Plot regression quantiles plot(Count ~ Esterase, data = esterase, cex.lab = cx.lab, cex.axis = cx.ax, cex = cx.p, pch = 1, col = "gray70") apply(yhat, 2, function(y,x,lw) lines(x,y,lwd = lw), x = esterase$Esterase, lw = lw) mtext("(a)", 3, line = 0.5, cex = cx.lab) # Plot restricted regression quantiles plot(Count ~ Esterase, data = esterase, cex.lab = cx.lab, cex.axis = cx.ax, cex = cx.p, pch = 1, cex.main = cx.lab, col = "gray70") apply(yhat2, 2, function(y,x,lw) lines(x,y,lwd = lw), x = esterase$Esterase, lw = lw) mtext("(b)", 3, line = 0.5, cex = cx.lab)
As discussed above, the reliability of the results depends on the validity of the restriction carried by RRQ. A quick check can be performed using the location--scale-shift specification of the Khmaladze test.
> kt <- KhmaladzeTest(formula = Count ~ Esterase, data = esterase, + taus = seq(.05,.95,by = .01), nullH = "location-scale") > KhmaladzeFormat(kt, 0.05) Khmaladze test for the location-shift hypothesis Joint test is not significant at 10% level Test(s) for individual slopes: not significant at 10% level
The quantile crossing problem can be approached also by directly rearranging the fitted values $\hat{Q}_{Y|X = \mathbf{x}}(p)$ to obtain monotone (in $p$) predictions for each $\mathbf{x}$ [@chernozhukov_etal]. This method is implemented in the package Rearrangement [@rearrangement]. As compared to RRQ, this approach is more general as it is not confined to, for example, location--scale-shift models [@chernozhukov_etal]; however, in contrast to RRQ, it does not yield estimates of parameters (e.g.\ slopes) of the model underlying the final monotonised curves. Such estimates, available from rrq objects, may be of practical utility when summarising the results.
Modeling discrete response variables, such as categorical and count responses, has been traditionally approached with distribution-based methods: a parametric model $F_{Y|X}(y;\theta)$ is assumed and then fitted by means of MLE. Binomial, negative binomial, multinomial and Poisson regressions are well-known in many applied sciences. Because of the computational advantages and the asymptotic properties of MLE, these methods have long ruled among competing alternatives.
Modeling conditional functions of discrete data is less common and, on a superficial level, might even appear as an unnecessary complication. However, a deeper look at its rationale will reveal that a distribution-free analysis can provide insightful information in the discrete case as it does in the continuous case. Indeed, methods for conditional quantiles of continuous distributions can be---and have been---adapted to discrete responses.
Let $Y$ be a count variable such as, for example, the number of car accidents during a week or the number of visits of a patient to their doctor during a year. As usual, $X$ denotes a vector of covariates. Poisson regression, which belongs to the family of generalized linear models (GLMs), is a common choice for this kind of data, partly because of its availability in many statistical packages. Symbolically, $Y \sim \textrm{Pois}(\theta)$ where [ \theta \equiv \mathrm{E}(Y|X = \mathbf{x}) = h^{-1}\left(\mathbf{x}^{\top}\boldsymbol{\beta}\right) ] and $h$ is the logarithmic link function. Note that the variance also is equal to $\theta$. Indeed, moments of order higher than $2$ governing the shape of the distribution depend on the same parameter. Every component of the conditional LSS in a Poisson model is therefore controlled by $\theta$. If needed, more flexibility can be achieved using a distribution-free approach.
@machado_silva_2005 proposed the model \begin{equation}\label{eq:17} \tag{17} Q_{h\left(Z;p\right)}(p) = \mathbf{x}^{\top}\boldsymbol \beta(p), \end{equation} where $Z = Y + U$ is obtained by jittering $Y$ with a $[0,1)$-uniform noise $U$, independent of $Y$ and $X$. In principle, any monotone transformation $h$ can be considered. A natural choice for count data is a log-linear model \citep{machado_silva_2005}, i.e. \begin{equation} h\left(Z;p\right) = \begin{cases} \log\left(Z - p\right) & \text{for $Z > p$}\[.5cm] \log \zeta & \text{for $Z \leq p$}. \end{cases} \end{equation} where $0 < \zeta < p$. It follows that $Q_{Z|X}(p) = p + \exp\left(\mathbf{x}^{\top}\boldsymbol \beta(p)\right)$. (Note that the $p$th quantile of the conditional distribution of $Z$ is bounded below by $p$.) Given the continuity between counts induced by jittering, standard inference for linear quantile functions [@koenker_bassett_1978] can be applied to fit (17). In practice, a sample of $M$ jittered responses $Z$ is taken to estimate $\hat{\boldsymbol \beta}{m}(p)$, $m = 1,\ldots,M$; the noise is then averaged out, $\hat{\boldsymbol \beta}(p) = \frac{1}{M} \sum{m} \hat{\boldsymbol \beta}_{m}(p)$.
data("esterase") gg_color_hue <- function(n) { hues = seq(15, 375, length=n+1) hcl(h=hues, l=65, c=100)[1:n] } # Plot quantiles 0.25, 0.5, 0.75 cc <- gg_color_hue(2) lw <- 2 cx.lab <- .75 cx.ax <- .75 fit1 <- rq.counts(Count ~ Esterase, tau = 0.1, data = esterase, M = 50) fit2 <- rq.counts(Count ~ Esterase, tau = 0.5, data = esterase, M = 50) fit3 <- rq.counts(Count ~ Esterase, tau = 0.9, data = esterase, M = 50) fit.glm <- glm(Count ~ Esterase, data = esterase, family = poisson()) par(mar = c(4.1, 4.1, 2.1, 2.1)) plot(Count ~ Esterase, data = esterase, xlab = "Esterase", ylab = "Count", cex.lab = cx.lab, cex.axis = cx.ax, cex = 1) lambda <- fit.glm$fitted.values yhat <- mapply(qpois, c(0.1,0.5,0.9), MoreArgs = list(lambda = lambda)) lines(esterase$Esterase, yhat[,1], lwd = lw, col = cc[1]) lines(esterase$Esterase, yhat[,2], lwd = lw, col = cc[1]) lines(esterase$Esterase, yhat[,3], lwd = lw, col = cc[1]) yhat <- cbind(fit1$fitted.values, fit2$fitted.values, fit3$fitted.values) lines(esterase$Esterase, yhat[,1], lwd = lw, col = cc[2]) lines(esterase$Esterase, yhat[,2], lwd = lw, col = cc[2]) lines(esterase$Esterase, yhat[,3], lwd = lw, col = cc[2]) legend(5,1000, legend = c("Poisson","QR"), col = c(cc), lty = c(1,1), lwd = lw, cex = cx.lab)
@machado_silva_2005's methods, including large-$n$ approximations for standard errors, are implemented in the function rq.counts. The formula argument specifies a log-linear model as in (17). Prior to version 1.4, flexible transformations as described above (e.g., Box-Cox) were allowed. As of version 1.4 of the package, this option has been removed since inference under transformations other than logarithmic was found unreliable and this was likely due to coding issues.
In the example below, estimation is carried out using $M=50$ jittered samples and $\zeta = 10^{-5}$ (see @machado_silva_2005 for further details on these settings).
R> data(esterase) R> fit.rq.counts <- rq.counts(formula = Count ~ Esterase, tau = 0.1, + data = esterase, M = 50, zeta = 1e-05)
Figure 9 shows a contrast between centile curves as predicted by the Poisson and the QR models in the Esterase data set. The Poisson distribution clearly underestimates the variability in the data. An empirical modeling of the conditional quantiles seems to be preferred in this case. Of course, the assumption of log-linearity of the models would need to be carefully assessed (note that GOFTest can be applied also to rq.counts objects).
@geraci_farcomeni_2019 developed quantile regression methods for discrete responses by extending Parzen's definition of marginal mid-quantiles which we discussed previously. These methods are general as they can be applied to a large variety of discrete responses, including binary, ordinal, and count variables.
@geraci_farcomeni_2019 first defined the conditional mid-distribution function as \begin{equation}\label{eq:18} \tag{18} G_{Y|X}(y) \equiv \Pr(Y \leq y|X) - 0.5\cdot \Pr(Y = y|X). \end{equation} Then, they introduced the conditional mid-quantile function as its inverse, that is $H_{Y|X}(p) \equiv G^{-1}_{Y|X}(p)$.
Estimation is carried out in a two-step approach. In the first step, conditional mid-probabilities are obtained nonparametrically and, in the second step, regression coefficients are estimated by solving an implicit equation. When constraining the quantile index to a data-driven admissible range, the second-step estimating equation has a least-squares type, closed-form solution. Such estimator is shown to be strongly consistent and asymptotically normal. Its good performance is shown in a simulation study with data generated from different discrete response models.
data("esterase") gg_color_hue <- function(n) { hues = seq(15, 375, length=n+1) hcl(h=hues, l=65, c=100)[1:n] } # Plot quantiles 0.5, 0.9 cc <- gg_color_hue(2) lw <- 2 cx.lab <- .75 cx.ax <- .75 fit <- midrq(Count ~ Esterase, tau = c(0.5, 0.9), data = esterase, type = 1, lambda = 0) par(mar = c(4.1, 4.1, 2.1, 2.1)) plot(Count ~ Esterase, data = esterase, xlab = "Esterase", ylab = "Count", cex.lab = cx.lab, cex.axis = cx.ax, cex = 1) yhat <- predict(fit) lines(esterase$Esterase, yhat[,1], lwd = lw, col = cc[2]) lines(esterase$Esterase, yhat[,2], lwd = lw, col = cc[2])
In the example below, the function midrq is applied to the Esterase data set to obtain 10th and 90th conditional mid-quantiles (plotted in Figure 10). As opposed to rq.counts, the function midrq supports Box-Cox and Aranda-Ordaz transformations. S3 methods for midrq objects (e.g., summary, coef, predict, residuals, vcov) are available in the Qtools package. The latter also provides the function cmidecdf to fit the conditional mid-CDF. See @geraci_farcomeni_2019 for details on estimation.
R> data(esterase) R> fit.midrq <- midrq(Count ~ Esterase, tau = c(0.5, 0.9), + data = esterase, type = 1, lambda = 0)
Regression models play an important role in conditional imputation of missing values. QR can be used as an effective approach for multiple imputation (MI) when location-shift models are inadequate [@munoz_rueda_2009;@bottai_zhen_2013;@geraci_2016].
In Qtools, mice.impute.rq and mice.impute.rrq are auxiliary functions written to be used along with the functions of the R package mice [@mice]. The former is based on the standard QR estimator (rq.fit) while the latter on the restricted counterpart (rrq.fit). Both imputation functions allow for the specification of the transformation-based QR models discussed previously. The equivariance property is useful to achieve linearity of the conditional model and to ensure that imputations lie within some interval when imputed variables are bounded. An example is available from the help file \samp{?mice.impute.rq} using the nhanes data set. See also @geraci_2016 and @geraci_mclain_2018 for a thorough description of these methods.
Quantiles have long occupied an important place in statistics. The package Qtools builds on recent methodological and computational developments of quantile functions and related methods to promote their application in statistical data modelling.
This work was partially supported by an ASPIRE grant from the Office of the Vice President for Research at the University of South Carolina. Please cite the Qtools package as: Geraci, M. (2016). Qtools: A collection of models and other tools for quantile inference. R Journal, 8(2), 117-138.
Any scripts or data that you put into this service are public.
Add the following code to your website.
For more information on customizing the embed code, read Embedding Snippets.