# qqPlot: Quantile-Quantile (Q-Q) Plot In EnvStats: Package for Environmental Statistics, Including US EPA Guidance

## Description

Produces a quantile-quantile (Q-Q) plot, also called a probability plot. The qqPlot function is a modified version of the R functions qqnorm and qqplot. The EnvStats function qqPlot allows the user to specify a number of different distributions in addition to the normal distribution, and to optionally estimate the distribution parameters of the fitted distribution.

## Details

If y is not supplied, the vector x is assumed to be a sample from the probability distribution specified by the argument distribution (and param.list if estimate.params=FALSE). When plot.type="Q-Q", the quantiles of x are plotted on the y-axis against the quantiles of the assumed distribution on the x-axis.

If y is supplied and plot.type="Q-Q", the empirical quantiles of y are plotted against the empirical quantiles of x.

When plot.type="Tukey Mean-Difference Q-Q", the difference of the quantiles is plotted on the y-axis against the mean of the quantiles on the x-axis.

Special Distributions
When y is not supplied and the argument distribution specifies one of the following distributions, the function qqPlot behaves in the manner described below.

"lnorm"

Lognormal Distribution. The log-transformed quantiles are plotted against quantiles from a Normal (Gaussian) distribution.

"lnormAlt"

Lognormal Distribution (alternative parameterization). The untransformed quantiles are plotted against quantiles from a Lognormal distribution.

"lnorm3"

Three-Parameter Lognormal Distribution. The quantiles of log(x-threshold) are plotted against quantiles from a Normal (Gaussian) distribution. The value of threshold is either specified in the argument param.list, or, if estimate.params=TRUE, then it is estimated.

"zmnorm"

Zero-Modified Normal Distribution. The quantiles of the non-zero values (i.e., x[x!=0]) are plotted against quantiles from a Normal (Gaussian) distribution.

"zmlnorm"

Zero-Modified Lognormal Distribution. The quantiles of the log-transformed positive values (i.e., log(x[x>0])) are plotted against quantiles from a Normal (Gaussian) distribution.

"zmlnormAlt"

Lognormal Distribution (alternative parameterization). The quantiles of the untransformed positive values (i.e., x[x>0]) are plotted against quantiles from a Lognormal distribution.

Explanation of Q-Q Plots
A probability plot or quantile-quantile (Q-Q) plot is a graphical display invented by Wilk and Gnanadesikan (1968) to compare a data set to a particular probability distribution or to compare it to another data set. The idea is that if two population distributions are exactly the same, then they have the same quantiles (percentiles), so a plot of the quantiles for the first distribution vs. the quantiles for the second distribution will fall on the 0-1 line (i.e., the straight line y = x with intercept 0 and slope 1). If the two distributions have the same shape and spread but different locations, then the plot of the quantiles will fall on the line y = x + b (parallel to the 0-1 line) where b denotes the difference in locations. If the distributions have different locations and differ by a multiplicative constant m, then the plot of the quantiles will fall on the line y = mx + b (D'Agostino, 1986a, p. 25; Helsel and Hirsch, 1986, p. 42). Various kinds of differences between distributions will yield various kinds of deviations from a straight line.

Comparing Observations to a Hypothesized Distribution
Let \underline{x} = x_1, x_2, …, x_n denote the observations in a random sample of size n from some unknown distribution with cumulative distribution function F(), and let x_{(1)}, x_{(2)}, …, x_{(n)} denote the ordered observations. Depending on the particular formula used for the empirical cdf (see ecdfPlot), the i'th order statistic is an estimate of the i/(n+1)'th, (i-0.5)/n'th, etc., quantile. For the moment, assume the i'th order statistic is an estimate of the i/(n+1)'th quantile, that is:

\hat{F}[x_{(i)}] = \hat{p}_i = \frac{i}{n+1} \;\;\;\;\;\; (1)

so

x_{(i)} \approx F^{-1}(\hat{p}_i) \;\;\;\;\;\; (2)

If we knew the form of the true cdf F, then the plot of x_{(i)} vs. F^{-1}(\hat{p}_i) would form approximately a straight line based on Equation (2) above. A probability plot is a plot of x_{(i)} vs. F_0^{-1}(\hat{p}_i), where F_0 denotes the cdf associated with the hypothesized distribution. The probability plot should fall roughly on the line y=x if F=F_0. If F and F_0 merely differ by a shift in location and scale, that is, if F[(x - μ) / σ] = F_0(x), then the plot should fall roughly on the line y = σ x + μ.

The quantity \hat{p}_i = i/(n+1) in Equation (1) above is called the plotting position for the probability plot. This particular formula for the plotting position is appealing because it can be shown that for any continuous distribution

E\{F[x_{(i)}]\} = \frac{i}{n+1} \;\;\;\;\;\; (3)

(Nelson, 1982, pp. 299-300; Stedinger et al., 1993). That is, the i'th plotting position defined as in Equation (1) is the expected value of the true cdf evaluated at the i'th order statistic. Many authors and practitioners, however, prefer to use a plotting position that satisfies:

F^{-1}(\hat{p}_i) = E[x_{(i)}] \;\;\;\;\;\; (4)

or one that satisfies

F^{-1}(\hat{p}_i) = M[x_{(i)}] = F^{-1}\{M[u_{(i)}]\} \;\;\;\;\;\; (5)

where M[x_{(i)}] denotes the median of the distribution of the i'th order statistic, and u_{(i)} denotes the i'th order statistic in a random sample of n uniform (0,1) random variates.

The plotting positions in Equation (4) are often approximated since the expected value of the i'th order statistic is often difficult and time-consuming to compute. Note that these plotting positions will differ for different distributions.

The plotting positions in Equation (5) were recommended by Filliben (1975) because they require computing or approximating only the medians of uniform (0,1) order statistics, no matter what the form of the assumed cdf F_0. Also, the median may be preferred as a measure of central tendency because the distributions of most order statistics are skewed.

Most plotting positions can be written as:

\hat{p}_i = \frac{i - a}{n - 2a + 1} \;\;\;\;\;\; (6)

where 0 ≤ a ≤ 1 (D'Agostino, 1986a, p.25; Stedinger et al., 1993). The quantity a is sometimes called the “plotting position constant”, and is determined by the argument plot.pos.con in the function qqPlot. The table below, adapted from Stedinger et al. (1993), displays commonly used plotting positions based on equation (6) for several distributions.

 Distribution Often Used Name a With References Weibull 0 Weibull, Weibull (1939), Uniform Stedinger et al. (1993) Median 0.3175 Several Filliben (1975), Vogel (1986) Blom 0.375 Normal Blom (1958), and Others Looney and Gulledge (1985) Cunnane 0.4 Several Cunnane (1978), Chowdhury et al. (1991) Gringorten 0.44 Gumbel Gringorton (1963), Vogel (1986) Hazen 0.5 Several Hazen (1914), Chambers et al. (1983), Cleveland (1993)

For moderate and large sample sizes, there is very little difference in visual appearance of the Q-Q plot for different choices of plotting positions.

Comparing Two Data Sets
Let \underline{x} = x_1, x_2, …, x_n denote the observations in a random sample of size n from some unknown distribution with cumulative distribution function F(), and let x_{(1)}, x_{(2)}, …, x_{(n)} denote the ordered observations. Similarly, let \underline{y} = y_1, y_2, …, y_m denote the observations in a random sample of size m from some unknown distribution with cumulative distribution function G(), and let y_{(1)}, y_{(2)}, …, y_{(m)} denote the ordered observations. Suppose we are interested in investigating whether the shape of the distribution with cdf F is the same as the shape of the distribution with cdf G (e.g., F and G may both be normal distributions but differ in mean and standard deviation).

When n = m, we can visually explore this question by plotting y_{(i)} vs. x_{(i)}, for i = 1, 2, …, n. The values in \underline{y} are spread out in a certain way depending on the true distribution: they may be more or less symmetric about some value (the population mean or median) or they may be skewed to the right or left; they may be concentrated close to the mean or median (platykurtic) or there may be several observations “far away” from the mean or median on either side (leptokurtic). Similarly, the values in \underline{x} are spread out in a certain way. If the values in \underline{x} and \underline{y} are spread out in the same way, then the plot of y_{(i)} vs. x_{(i)} will be approximately a straight line. If the cdf F is exactly the same as the cdf G, then the plot of y_{(i)} vs. x_{(i)} will fall roughly on the straight line y = x. If F and G differ by a shift in location and scale, that is, if F[(x-μ)/σ] = G(x), then the plot will fall roughly on the line y = σ x + μ.

When n > m, a slight adjustment has to be made to produce the plot. Let \hat{p}_1, \hat{p}_2, …, \hat{p}_m denote the plotting positions corresponding to the m empirical quantiles for the y's and let \hat{p}^*_1, \hat{p}^*_2, …, \hat{p}^*_n denote the plotting positions corresponding the n empirical quantiles for the x's. Then we plot y_{(j)} vs. x^*_{(j)} for j = 1, 2, …, m where

x^*_{(j)} = (1 - r) x_{(i)} + r x_{(i+1)} \;\;\;\;\;\; (7)

r = \frac{\hat{p}_j - \hat{p}^*_i}{\hat{p}^*_{i+1} - \hat{p}^*_i} \;\;\;\;\;\; (8)

\hat{p}^*_i ≤ \hat{p}_j ≤ \hat{p}^*_{i+1} \;\;\;\;\;\; (9)

That is, the values for the x^*_{(j)}'s are determined by linear interpolation based on the values of the plotting positions for \underline{x} and \underline{y}.

Note that the R function qqplot uses a different method than the one in Equation (7) above; it uses linear interpolation based on 1:n and m by calling the approx function.

## Value

qqPlot returns a list with components x and y, giving the (x,y) coordinates of the points that have been or would have been plotted. There are four cases to consider:

1. The argument y is not supplied and plot.type="Q-Q".

 x the quantiles from the theoretical distribution. y the observed quantiles (order statistics) based on the data in the argument x.

2. The argument y is not supplied and plot.type="Tukey Mean-Difference Q-Q".

 x the averages of the observed and theoretical quantiles. y the differences between the observed quantiles (order statistics) and the theoretical quantiles.

3. The argument y is supplied and plot.type="Q-Q".

 x the observed quantiles based on the data in the argument x. Note that these are adjusted quantiles if the number of observations in the argument x is greater then the number of observations in the argument y. y the observed quantiles based on the data in the argument y. Note that these are adjusted quantiles if the number of observations in the argument y is greater then the number of observations in the argument x.

4. The argument y is supplied and plot.type="Tukey Mean-Difference Q-Q".

 x the averages of the quantiles based on the argument x and the quantiles based on the argument y. y the differences between the quantiles based on the argument x and the quantiles based on the argument y.

## Note

A quantile-quantile (Q-Q) plot, also called a probability plot, is a plot of the observed order statistics from a random sample (the empirical quantiles) against their (estimated) mean or median values based on an assumed distribution, or against the empirical quantiles of another set of data (Wilk and Gnanadesikan, 1968). Q-Q plots are used to assess whether data come from a particular distribution, or whether two datasets have the same parent distribution. If the distributions have the same shape (but not necessarily the same location or scale parameters), then the plot will fall roughly on a straight line. If the distributions are exactly the same, then the plot will fall roughly on the straight line y=x.

A Tukey mean-difference Q-Q plot, also called an m-d plot, is a modification of a Q-Q plot. Rather than plotting observed quantiles vs. theoretical quantiles or observed y-quantiles vs. observed x-quantiles, a Tukey mean-difference Q-Q plot plots the difference between the quantiles on the y-axis vs. the average of the quantiles on the x-axis (Cleveland, 1993, pp.22-23). If the two sets of quantiles come from the same parent distribution, then the points in this plot should fall roughly along the horizontal line y=0. If one set of quantiles come from the same distribution with a shift in median, then the points in this plot should fall along a horizontal line above or below the line y=0. A Tukey mean-difference Q-Q plot enhances our perception of how the points in the Q-Q plot deviate from a straight line, because it is easier to judge deviations from a horizontal line than from a line with a non-zero slope.

In a Q-Q plot, the extreme points have more variability than points toward the center. A U-shaped Q-Q plot indicates that the underlying distribution for the observations on the y-axis is skewed to the right relative to the underlying distribution for the observations on the x-axis. An upside-down-U-shaped Q-Q plot indicates the y-axis distribution is skewed left relative to the x-axis distribution. An S-shaped Q-Q plot indicates the y-axis distribution has shorter tails than the x-axis distribution. Conversely, a plot that is bent down on the left and bent up on the right indicates that the y-axis distribution has longer tails than the x-axis distribution.

## Author(s)

Steven P. Millard ([email protected])

## References

Chambers, J.M., W.S. Cleveland, B. Kleiner, and P.A. Tukey. (1983). Graphical Methods for Data Analysis. Duxbury Press, Boston, MA, pp.11-16.

Cleveland, W.S. (1993). Visualizing Data. Hobart Press, Summit, New Jersey, 360pp.

D'Agostino, R.B. (1986a). Graphical Analysis. In: D'Agostino, R.B., and M.A. Stephens, eds. Goodness-of Fit Techniques. Marcel Dekker, New York, Chapter 2, pp.7-62.

ppoints, ecdfPlot, Distribution.df, qqPlotGestalt, qqPlotCensored, qqnorm.
  1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33 34 35 36 37 38 39 40 41 42 43 44 45 46 47 48 49 50 51 52 53 54 55 56 57 58 59 60 61 62 63 64 65 66 67 68 69 70 71 72 73 74 75 76 77 78 79 80 81 82 83 84 85 86 87 88 89 90 91 92 93 94 95 96 97 98 99 100 101 102 103 104 105 106 107 108  # The guidance document USEPA (1994b, pp. 6.22--6.25) # contains measures of 1,2,3,4-Tetrachlorobenzene (TcCB) # concentrations (in parts per billion) from soil samples # at a Reference area and a Cleanup area. These data are strored # in the data frame EPA.94b.tccb.df. # # Create an Q-Q plot for the reference area data first assuming a # normal distribution, then a lognormal distribution, then a # gamma distribution. # Assume a normal distribution #----------------------------- dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"])) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], plot.type = "Tukey", add.line = TRUE)) # The Q-Q plot based on assuming a normal distribution shows a U-shape, # indicating the Reference area TcCB data are skewed to the right # compared to a normal distribuiton. # Assume a lognormal distribution #-------------------------------- dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "lnorm", digits = 2, points.col = "blue", add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "lnorm", digits = 2, plot.type = "Tukey", points.col = "blue", add.line = TRUE)) # Alternative parameterization dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "lnormAlt", estimate.params = TRUE, digits = 2, points.col = "blue", add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "lnormAlt", digits = 2, plot.type = "Tukey", points.col = "blue", add.line = TRUE)) # The lognormal distribution appears to be an adequate fit. # Now look at a Q-Q plot assuming a gamma distribution. #---------------------------------------------------------- dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "gamma", estimate.params = TRUE, digits = 2, points.col = "blue", add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "gamma", digits = 2, plot.type = "Tukey", points.col = "blue", add.line = TRUE)) # Alternative Parameterization dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "gammaAlt", estimate.params = TRUE, digits = 2, points.col = "blue", add.line = TRUE)) dev.new() with(EPA.94b.tccb.df, qqPlot(TcCB[Area == "Reference"], dist = "gammaAlt", digits = 2, plot.type = "Tukey", points.col = "blue", add.line = TRUE)) #------------------------------------------------------------------------------------- # Generate 20 observations from a gamma distribution with parameters # shape=2 and scale=2, then create a normal (Gaussian) Q-Q plot for these data. # (Note: the call to set.seed simply allows you to reproduce this example.) set.seed(357) dat <- rgamma(20, shape=2, scale=2) dev.new() qqPlot(dat, add.line = TRUE) # Now assume a gamma distribution and estimate the parameters #------------------------------------------------------------ dev.new() qqPlot(dat, dist = "gamma", estimate.params = TRUE, add.line = TRUE) # Clean up #--------- rm(dat) graphics.off()